Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology Dr. Jens Allmer Lecture Slides Week 3.

Similar presentations


Presentation on theme: "Computational Biology Dr. Jens Allmer Lecture Slides Week 3."— Presentation transcript:

1 Computational Biology Dr. Jens Allmer Lecture Slides Week 3

2 MBG404 Overview Data Generation Processing Storage Mining Pipelining

3 Initialization Files There are many types of initialization files Often a plain text format is used such as –Name=value || Name:value –Usually separated by line breaks Nearly always commenting parts of the information is supported (here ;) XML and JSON are becomming more and more popular in this area –Many programs adopt JSON or XML formatted ini files Name=value

4 File Formats FASTA –William R. Pearson –Program Suite –File Format Consists of –Definition line (starts with ‘>’) –Ends with line break or carriage return or both –Following the definition line are the sequences –ONLY IUPAC characters Nucleotides Amino Acids –Ends with Next > End of file >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK GenBank gi|gi-number|gb|accession|locus EMBL Data Library gi|gi-number|emb|accession|locus DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus NBRF PIR pir||entry Protein Research Foundation prf||name SWISS-PROT sp|accession|name Brookhaven Protein Data Bank pdb|entry|chain Brookhaven Protein Data Bank entry:chain|PDBID|CHAIN|SEQUENCE Patents pat|country|number GenInfo Backbone Id bbs|number General database identifier gnl|database|identifier NCBI Reference Sequence ref|accession|locus Local Sequence identifier lcl|identifier

5 FASTA Soft Rules –Depend on the impementation of a specific program –No space directly after ‘>’ –Should not contain more than 80 characters –May not contain whitespace Space Tabulator Other control characters –May only contain IUPAC nucleotides and amino acids Note: often all foreign characters are simply ignored

6 eXtensible Markup Language Aims to describe data Each item in an XML file can be clearly defined and understood Can be validated Is humanly readable Could be used to invent HTML Can be used instead of HTML Contains no formatting instructions (HTML, CSS, XSLT)

7 XML A tag –Is a word enclosed in angle brackets ‘<>’ – An element consist of an opening and a closing tag – opening tag – closing tag (note slash ‘/’) An element can be abbreviated if it contains no further elements – open and closed in one statement An element may contain any number of elements An element may contain text information Some important information about tags

8 XML An element can contain attributes which describe the element further – The name of an element should describe its content – –Should contain.. Of course.. Sequences Children of an element should be related to the parent Attributes should extent the information logically THEACTUALSEQUENCETHATISDESCRIBEDINTHISELEMENT

9 XML Elements and Attributes can be used interchangebly gi|532319|pir|TVFV2E|TVFV2E envelope protein 35 THEACTUALSEQUENCETHATISDESCRIBEDINTHISELEMENT Attributes and Elements can be mixed If the elements are properly nested and the tags are appropriately opened, closed and matched and the attributes are properly quoted... –Then the XML is well formed –It is not neccessarily valid

10 XML Validation Data can be precisely described Only elements defined may be used in an XML file if it is to be valid Validation styles –Document Type Definitions DTD –XML Schema Validation ensures that only expected data is in the document Program can then ensure that they understand the document

11 An XML File Declaration for the consumer of the XML file – A root element which will contain the complete data needs to be defined – An XML file is like a tree consiting of –Roots? No.. 1 root –Trunk elements immediately following the root –Branches (Elements containing further elements [nested]) –Leaves (Elements with no contained elements) Where are the attributes?

12 XML File Example gi|532319|pir|TVFV2E|TVFV2E envelope protein 35 THEACTUALSEQUENCETHATISDESCRIBEDINTHISELEMENT gi|732413|pir|PDP22E|PDP22E predicted protein 35 ANOTHERSEQUENCETHATMAYBERECORDEDSOMEWHERE

13 Styling of XML (4get HTML) Cascading Stylesheets (CSS) Define styles for the elements that shall be displayed Sequence { Font-famiy: Courier,Monospace; Background-color:#DDDDDD; display:block; border-bottom: 1px outset green; } Identifier { Font-family: Arial,Helvetica font-size: 1.2em; font-weight: bold; } Length { display:none; } Connecting XML and CSS – –Add at the top of the file just after version information

14 Connecting to an XML Schema The schema or definition is linked in the root element

15 JavaScript Object Notation JSON models objects An object is enclosed in curly braces ‘{ }’ An array is enclosed in braces ‘[ ]’ Pairs –“Name” : “Value” –Separated by commas Example {“Sequence“ : “ACGCTAGCCGCATCGTAGC”} {“Sequence“ : “ACGCTAGCCGCATCGTAGC”, “Identifier” : “gi|532319|pir|TVFV2E|TVFV2E envelope protein”, “length” : 35 } {“Sequences” : [ {“Sequence“ : “ACGCTAGCCGCATCGTAGC”, “Identifier” : “gi|532319|pir|TVFV2E|TVFV2E envelope protein”, “length” : 19 } {“Sequence“ : “ACGCTAACGCTGATCAGCGGCATCGCCGCATCGTAGC”, “Identifier” : “gi|764319|pir|AVBB2E|AVBB2E predicted protein”, “length” : 38 } ] }

16 XML - JSON JSON is like XML because: –They are both 'self-describing' meaning that values are named, and thus 'human readable' –Both are hierarchical. (i.e. You can have values within values.) –Both can be parsed and used by lots of programming languages –Both can be passed around using AJAX (i.e. httpWebRequest) JSON is UNlike XML because: –XML uses angle brackets, with a tag name at the start and end of an element: JSON uses squiggly brackets with the name only at the beginning of the element. –JSON is less verbose so it's definitely quicker for humans to write, and probably quicker for us to read. –JSON can be parsed trivially using the eval() procedure in JavaScript –JSON includes arrays {where each element doesn't have a name of its own} –In XML you can use any name you want for an element, in JSON you can't use reserved words from javascript

17 ASN.1 Format

18 End of Theory I 5 min Mindmapping 10 min break

19 Practice I

20 File Formats FASTA –Create a FASTA formatted file –Goto mbg403.allmer.de/tools –Choose FASTAValidator and check if your FASTA file is correct.

21 File Formats XML –Write an XML file which will model a sequence similar to FASTA

22 Validation Validate the instance of the XML file that you wrote Copy and paste your data in one of those forms and see whether your XML –is well formed –is valid

23 Style the File Use the provided CSS and XML file as a reference Change the styling such that it is pleasant for you to view

24 U Design Write an XML file which models a person No validation possible unless you can create the schema as well (not part of this course) Add information about yourself in the file Style the file using CSS –http://www.w3.org/Style/CSS/http://www.w3.org/Style/CSS/ –Example from mbg403.allmer.de Present your creation

25 Theory II

26 Sequence Alignment Exact –simple Approximate –More difficult target pattern target pattern

27 Sequence Alignment Exact pattern matching –Naive method aligns pattern with each location of the target –Boyer-Moore indexes the pattern to skip some alignments –Wu-Manber indexes many patterns and skips some alignments –Indexing Suffix tree indexes target and then quickly finds each pattern Many other methods

28 Sequence Alignment Approximate pattern matching –Pairwise Local –Smith Waterman –BLAST –FASTA Global –Needlemann Wunsch –Multiple T-Coffee ClustalW...

29 Basic Local Alignment Seach Tool Input –Pattern –Target –Search parameters and settings Output –Alignments in various formats XML Help –http://www.ncbi.nlm.nih.gov/books/NBK1763/

30 BLAST Target –Needs to be indexed –Cannot be FASTA –Must fit to the pattern and BLAST variant protein target and protein pattern can be searched using blastp Target indexing –makeblastdb, in the BLAST package can index FASTA files –Needs sequence input (e.g. FASTA, asn.1) –Needs sequence type to be provided e.g.: protein

31 BLAST blastp –Needs indexed database –Needs query sequence (can be unindexed FASTA) –Produces alignments

32 32

33 Blast flavors BlastN - nt versus nt database BlastP - protein versus protein database BlastX - translated nt (6 frames) versus protein database tBlastN - protein versus translated nt database (6 frames) tBlastX - translated nt versus translated nt database (both 6 frames) Query: DNAProtein DB:DNAProtein

34 BLAST Output XML –-outfmt 5 This switch leads to XML output

35 Download Blast _TYPE=BlastDocs&DOC_TYPE=Downloadhttp://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE _TYPE=BlastDocs&DOC_TYPE=Download –Get blastp and makeblastdb from mbg404 since you are not allowed to install anything Download a Fasta file (protein, genome, collection of sequences in fasta format) –Database must consist of amino acids since we only have access to blastp today Use makeblastdb from the Blast package to index the file Several files will be created when you do it right

36 MakeDB Example –makeblastdb -in seq.fasta -dbtype prot -out seqBl –title seqBlastDB More information? –Go to the doc folder of BLAST –Documentation is there –http://www.ncbi.nlm.nih.gov/books/NBK1763/

37 BLAST Now that we have an indexed database try to run BLAST Read documentation and try to solve the simplest case –You will need the indexed database and you will need a FASTA file as query –You could create queries from the database and slightly change them Good luck

38 Theory II

39 Mass Spectra Recording (e.g. Triple Play)

40 Fragmentation Spectrum

41 MS/MS spectra MS/MS spectra can be assigned a peptide sequence (PSM) –Database search –De novo sequencing

42 PepNovo Performs de novo sequencing of MS/ MS spectra Takes a single spectrum as input Needs a mathematical model for its evaluation Will display the results in the console You will therefore need to redirect the output Example –?>PepNovo.exe -dta MSMSSpectrum.dta -model tryp_model.txt

43 KAIAQLEEDYL LYDEELQAIAK De Novo Sequencing ~ D (115.02) D DE ~ E (129.1) E

44 MS/MS spectra MS/MS spectra can be assigned a peptide sequence (PSM) –Database search –De novo sequencing

45 Correlation Database selection >1080ZR IAAYPGVSPGLMIHYNIGR >1137RZ AAYPGATQPGATELARRLGK >1152RZ GSGDAAYPGGPFFNLFNLGK >1152ZR GSGDAAYPGGPFFNLFNLGK >2360RZ VDSGWGGVVVVALAPYNLGR >240RZ HPGVVCRPGRGGGCSRHIGK HPGVVCCSRHRRSHTIGK

46 Initalization Files X!Tandem –Taxonomy.xml –Default_Input.xml –Input.xml Running X!Tandem –?>tandem.exe input.xml That was easy –But behold, what about the input?

47 Taxonomy XML

48 Input.xml Each one of the parameters for x! tandem is entered as a labeled note node. Any of the entries in the default_input.xml file can be over-ridden by adding a corresponding entry to this file. This file represents a minimum input file, with only entries for the default settings, the output file and the input spectra file name. See the taxonomy.xml file for a description of how FASTA sequence list files are linked to a taxon name. default_input.xml taxonomy.xml chlamy test_spectra.mgf output.xml Personally, I don’t approve of the XML used here Another input file

49 Default-input XML list path parameters default_input.xml This value is ignored when it is present in the default parameter list path. taxonomy.xml spectrum parameters yes Daltons The value for this parameter may be 'Daltons' or 'ppm': all other values are ignored ppm The value for this parameter may be 'Daltons' or 'ppm': all other values are ignored monoisotopic values are monoisotopic|average spectrum conditioning parameters The peaks read in are normalized so that the most intense peak is set to the dynamic range value. All peaks with values of less that 1, using this normalization, are not used. This normalization has the overall effect of setting a threshold value for peak intensities. 50 If this value is 0, it is ignored. If it is greater than zero (lets say 50), then the number of peaks in the spectrum with be limited to the 50 most intense peaks in the spectrum. X! tandem does not do any peak finding: it only limits the peaks used by this parameter, and the dynamic range parameter. 4 yes residue modification parameters

50 Beautifying XML XML –Only describes data Formatting of XML –Additional files can be linked to beautify the display Transformation (XSLT) –Translates XML into HTML XML Styling (CSS) –Describes formatting to the elements and attributes used in the XML file Both files need to be linked at the beginning of the XML file

51 XML What is an element? What is an attribute? Design a Person –What are attributes of a person? –Use elements for logical grouping –Use attributes for specific information

52 Styling Connect the example style Nothing will be styled ;) Examine the CSS file and rename the styles such that your person XML will be somewhat styled

53 End Theory II 5 min mindmapping 10 min break

54 Practice II

55 View Spectra and Sequence To view matching peaks of the PepNovo prediction and the spectrum at the same time Use the DtaViewer from

56 Download Download PepNovo –http://www- cse.ucsd.edu/groups/bioinformatics/software.html#pepnovohttp://www- cse.ucsd.edu/groups/bioinformatics/software.html#pepnovo –http://bioinformatics.allmer.de/toolshttp://bioinformatics.allmer.de/tools Download test file –http://bioinformatics.allmer.de/toolshttp://bioinformatics.allmer.de/tools

57 Try PepNovo Try to run PepNovo –Use the given input –Use the help information –Use the lecture slides –Use the lecture notes Aim –Store the result in a text file

58 PepNovo Results are displayed in the console We need to redirect the output into a file. ?>PepNovo.exe -dta MSMSSpectrum.dta -model tryp_model.txt > result.txt

59 X!Tandem Unzip folder and check Mgf formated spectra (file) Database file (FASTA) tandem-win folder Used.xml configuration files (default_input.xml, input.xml and taxonomy.xml) To get the same output given in zip folder; –Replace configuration files in «tandem-win\bin» folder with ones in «used» folder. –Also copy database file to «fasta» folder and.mgf file to «bin» in «tandem-win»

60 X!Tandem Console Application

61 X!Tandem Default Input Parameters such as mass tolerances, enzyme type, number of charged for search can be reset in default_input.xml

62 X!Tandem Input.xml In input.xml file, you should specify path of: taxonomy.xml default_input.xml Spectra filename Output filename NOTE: Here input.xml and all files above are in same folder(directory))

63 X!Tandem Taxonomy In taxonomy file, you should specify «database file path». In this example, database file is in «fasta» folder in «Xtandem\tandem-win » folder.

64 X!Tandem Output


Download ppt "Computational Biology Dr. Jens Allmer Lecture Slides Week 3."

Similar presentations


Ads by Google