Presentation is loading. Please wait.

Presentation is loading. Please wait.

MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I.

Similar presentations


Presentation on theme: "MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I."— Presentation transcript:

1 MBV3070 Bioinformatikk

2 Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I tillegg: 1.Tom Kristensen: Sekvenssammenstillinger. 7 sider. 2.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions- specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680. 3.D.G:Higgins, J.D.Thompson and T.J.Gibson: Using CLUSTAL for multiple sequence alignments. Methods Enzymol. 266 (1994) 383-402 4.??? (Genfinning) 5.???? (Mikromatriser

3 Innledning. Sekvensering. Databaser. Entrez og SRS. Dotplots Parvis sekvenssammenstilling FASTA og BLAST Flersekvenssammenstilling. ClustalW/ClustalX Motiver, profiler, PSI-BLAST Fylogeni Genomer. Analyse av genomisk DNA. Genfinning Mikromatriser (Ola Myklebost/Ole Chr. Lindgjærde) Proteinmodellering Fremdriftsplan Vincent Eijsink

4 Nyttige nettsteder for MBV3070 Emnets hjemmeside: http://www.uio.no/studier/emner/matnat/ molbio/MBV3070/v04/ http://www.uio.no/studier/emner/matnat/ molbio/MBV3070/v04/ Lærebokas hjemmeside: http://www.oup.com/uk/lesk/bioinf/ http://www.oup.com/uk/lesk/bioinf/

5 Hva er bioinformatikk? The NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following definitions of bioinformatics and computational biology recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations. Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

6 Andre måter å definere bioinformatikk på "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information." Fredj Tekaja, Institute Pasteur ”The use of computers to store, retrieve, analyze or predict the composition or the structure of biomolecules.” Damian Councell, bioinformatics.org

7 “It tries experiments. It wakes up every morning, does a little mutagenesis, changes a nucleotide here and there, and sees how it works. If it’s a success, it keeps the notes. In this notebook, we have all of the information of the greatest experimental tinkerer ever.” “For the last three and a half billion years, evolution has been taking notes.” Dr. Eric Lander Director of the Whitehead InstituteMIT Center for Genome Research

8 Hva betyr dette?

9 Base symbols AAdenine CCytosine GGuanine TThymine UUracil RGuanine / Adenine (puRine) YCytosine / Thymine (pYrimidine) KGuanine / Thymine (Keto) MAdenine / Cytosine (aMino) SGuanine / Cytosine (Strong) WAdenine / Thymine (Weak) BGuanine / Thymine / Cytosine (not A) DGuanine / Adenine / Thymine (not C) HAdenine / Cytosine / Thymine (not G) VGuanine / Cytosine / Adenine (not T) NAdenine / Guanine / Cytosine / Thymine

10 Hvorfor tvetydige symboler? Sekvenseringsinstrumenter vil ikke alltid kunne lese sekvensen entydig I konsensussekvenser er det nyttig med tvetydige symboler Sekvens 1 aagcggtaccag Sekvens 2 aaacagcaccaa Konsensus aarcrgyaccar

11 Den genetiske kode

12

13 Aminosyresymboler A Ala alanine B Asx aspartic acid or asparagine C Cys cysteine D Asp aspartic acid E Glu glutamic acid F Phe phenylalanine G Gly glycine H His histidine I Ile isoleucine K Lys lysine L Leu leucine M Met methionine N Asn asparagine P Pro proline Q Gln glutamine R Arg arginine S Ser serine T Thr threonine U Sec selenocysteine V Val valine W Trp tryptophan X Xaa unknown or 'other' amino acid Y Tyr tyrosine Z Glx glutamic acid or glutamine (or substances such as 4-carboxyglutamic acid and 5-oxoproline that yield glutamic acid on acid hydrolysis of peptides)

14 To måter å sekvensere på Shotgun-sekvensering: Dette er strategien som ble valgt av Celera for kommersiell sekvensering av det humane genom Ordnet sekvensering (top down): Denne strategien ble brukt i den ”offentlige” sekvensering av genomet, i et internasjonalt samarbeid

15 Ovenfra og nedover-strategi for sekvensering

16 To måter å sekvensere genomet på BAC to BAC Sequencing The BAC to BAC approach first creates a crude physical map of the whole genome before sequencing the DNA. Constructing a map requires cutting the chromosomes into large pieces and figuring out the order of these big chunks of DNA before taking a closer look and sequencing all the fragments. Whole Genome Shotgun Sequencing The shotgun sequencing method goes straight to the job of decoding, bypassing the need for a physical map. Therefore, it is much faster.

17 Fragmentering av genomet BAC to BAC Sequencing Whole Genome Shotgun Sequencing

18 Kloning av fragmentene BAC to BAC Sequencing Whole Genome Shotgun Sequencing

19 Plassering på kartet av BAC-klonene BAC to BAC Sequencing Whole Genome Shotgun Sequencing This step not needed in shotgun sequencing

20 Subkloner fra BAC-klonene BAC to BAC Sequencing Whole Genome Shotgun Sequencing This step not needed in shotgun sequencing

21 Sekvensering av klonene BAC to BAC Sequencing Whole Genome Shotgun Sequencing

22 Råsekvens fra et sekvenseringsinstrument

23 Oppbygging av sammenhengende sekvenser BAC to BAC Sequencing Whole Genome Shotgun Sequencing

24 Sammensetting av enkeltsekvenser til større sekvenser

25 DNA sequencing 2001

26 Biological databases Primary databases (archival) –GenBank, EMBL, DDBJ, PDB Secondary databases (curated) –PIR, SwissProt and everything else

27 Database Categories List http://www3.oup.co.uk/nar/database/c/ Genomics Databases (non-vertebrate) Human and other Vertebrate Genomes Human Genes and Diseases Metabolic and Signaling Pathways Microarray Data and other Gene Expression Databases Nucleotide Sequence Databases Other Molecular Biology Databases Protein sequence databases Proteomics Resources RNA sequence databases Structure Databases In all 548 databases, 162 more than one year ago

28 GenBank entry LOCUS LISOD 756 bp DNA BCT 30-JUN-1993 DEFINITION L.ivanovii sod gene for superoxide dismutase. ACCESSION X64011 S78972 NID g44010 VERSION X64011.1 GI:44010 KEYWORDS sod gene; superoxide dismutase. SOURCE Listeria ivanovii. ORGANISM Listeria ivanovii Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillaceae; Listeria. REFERENCE 1 (bases 1 to 756) AUTHORS Haas,A. and Goebel,W. TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii by functional complementation in Escherichia coli and characterization of the gene product JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992) MEDLINE 92140371 REFERENCE 2 (bases 1 to 756) AUTHORS Kreft,J. TITLE Direct Submission JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG

29 GenBank entry (cont.) FEATURES Location/Qualifiers source 1..756 /organism="Listeria ivanovii" /strain="ATCC 19119" /db_xref="taxon:1638" RBS 95..100 /gene="sod" gene 95..746 /gene="sod" CDS 109..717 /gene="sod" /EC_number="1.15.1.1" /codon_start=1 /transl_table=11 /product="superoxide dismutase" /protein_id="CAA45406.1" /db_xref="SWISS-PROT:P28763" /translation="MTYELPKLPYTYD… terminator 723..746 /gene="sod" BASE COUNT 247 a 136 c 151 g 222 t ORIGIN 1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 61 gtaatttctt //

30 EMBL database entry EMBL:TRBG361 ID TRBG361 standard; RNA; PLN; 1859 BP. XX AC X56734; S46826; XX SV X56734.1 XX DT 12-SEP-1991 (Rel. 29, Created) DT 15-MAR-1999 (Rel. 59, Last updated, Version 9) XX DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase XX KW beta-glucosidase. XX OS Trifolium repens (white clover) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Rosidae; OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium. XX

31 EMBL database entry (cont.) RN [5] RP 1-1859 RX MEDLINE; 91322517. RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; RT "Nucleotide and derived amino acid sequence of the cyanogenic RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.)."; RL Plant Mol. Biol. 17:209-219(1991). XX RN [6] RP 1-1859 RA Hughes M.A.; RT ; RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases. RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE RL UPON TYNE, NE2 4HH, UK XX DR AGDR; X56734; X56734. DR MENDEL; 11000; Trirp;1162;11000. DR SWISS-PROT; P26204; BGLS_TRIRP. XX

32 EMBL database entry (cont.) FH Key Location/Qualifiers FH FT source 1..1859 FT /db_xref="taxon:3899" FT /organism="Trifolium repens" FT /tissue_type="leaves" FT /clone_lib="lambda gt10" FT /clone="TRE361" FT CDS 14..1495 FT /db_xref="SWISS-PROT:P26204" FT /note="non-cyanogenic" FT /EC_number="3.2.1.21" FT /product="beta-glucosidase" FT /protein_id="CAA40058.1" FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK FT DQNMDSYRFSI…. FT mRNA 1..1859 FT /evidence=EXPERIMENTAL XX SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60 cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120 tcggagcagt tttcctcgtg

33 EMBL database fields Note that each line begins with a two-character line code, which indicates the type of information contained in the line. The currently used line types, along with their respective line codes, are listed below: ID - identification (begins each entry; 1 per entry) AC - accession number (>=1 per entry) SV - new sequence identifier (>=1 per entry) DT - date (2 per entry) DE - description (>=1 per entry) KW - keyword (>=1 per entry) OS - organism species (>=1 per entry) OC - organism classification (>=1 per entry) OG - organelle (0 or 1 per entry) RN - reference number (>=1 per entry) RC - reference comment (>=0 per entry)

34 EMBL database fields (cont.) RP - reference positions (>=1 per entry) RX - reference cross-reference (>=0 per entry) RA - reference author(s) (>=1 per entry) RT - reference title (>=1 per entry) RL - reference location (>=1 per entry) DR - database cross-reference (>=0 per entry) FH - feature table header (0 or 2 per entry) FT - feature table data (>=0 per entry) CC - comments or notes (>=0 per entry) XX - spacer line (many per entry) SQ - sequence header (1 per entry) bb - (blanks) sequence data (>=1 per entry) // - termination line (ends each entry; 1 per entry)

35 The feature table The overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. The Feature Table documentation represents the shared rules that allow the three databases to exchange data on a daily basis. The range of features to be represented is diverse, including regions which:  perform a biological function,  affect or are the result of the expression of a biological function,  interact with other molecules,  affect replication of a sequence,  affect or are the result of recombination of different sequences,  are a recognizable repeated unit,  have secondary or tertiary structure,  exhibit variation, or  have been revised or corrected.

36 Feature table terminology The format and wording in the feature table use common biological research terminology whenever possible. For example, an item in the new feature table such as: Key Location/Qualifiers CDS 23..400 /product="alcohol dehydrogenase" /gene="adhI" might be read as: The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called 'alcohol dehydrogenase' and corresponds to the gene called 'adhI'.

37 Feature table terminology (cont.) A more complex description: Key Location/Qualifiers CDS join(544..589,688..1032) /product="T-cell receptor beta-chain" /partial which might be read as: This feature, which is a partial coding sequence is formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.

38 Feature key examples Key Description conflict Separate determinations of the "same" sequence differ rep_origin Origin of replication protein_bind Protein binding site on DNA CDS Protein-coding sequence misc_RNA Generic label for an undefined RNA insertion_seq Insertion element D-loop Mitochondrial or other D-loop structure


Download ppt "MBV3070 Bioinformatikk. Pensumliste MBV3070 - Bioinformatikk Arthur M. Lesk: Introduction to Bioinformatics. Oxford University Press 2002. 270 sider I."

Similar presentations


Ads by Google