Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology

Similar presentations

Presentation on theme: "Computational Biology"— Presentation transcript:

1 Computational Biology
Dr. Jens Allmer Lecture Slides Week 3

2 MBG404 Overview Processing Pipelining Generation Data Storage Mining

3 Initialization Files There are many types of initialization files
Often a plain text format is used such as Name=value || Name:value Usually separated by line breaks Nearly always commenting parts of the information is supported (here ;) XML and JSON are becomming more and more popular in this area Many programs adopt JSON or XML formatted ini files Name=value

4 File Formats FASTA Consists of William R. Pearson Program Suite
Definition line (starts with ‘>’) Ends with line break or carriage return or both Following the definition line are the sequences ONLY IUPAC characters Nucleotides Amino Acids Ends with Next > End of file GenBank gi|gi-number|gb|accession|locus EMBL Data Library gi|gi-number|emb|accession|locus DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus NBRF PIR pir||entry Protein Research Foundation prf||name SWISS-PROT sp|accession|name Brookhaven Protein Data Bank pdb|entry|chain Brookhaven Protein Data Bank entry:chain|PDBID|CHAIN|SEQUENCE Patents pat|country|number GenInfo Backbone Id bbs|number General database identifier gnl|database|identifier NCBI Reference Sequence ref|accession|locus Local Sequence identifier lcl|identifier >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK

5 FASTA Soft Rules Depend on the impementation of a specific program
No space directly after ‘>’ Should not contain more than 80 characters May not contain whitespace Space Tabulator Other control characters May only contain IUPAC nucleotides and amino acids Note: often all foreign characters are simply ignored

6 eXtensible Markup Language
Aims to describe data Each item in an XML file can be clearly defined and understood Can be validated Is humanly readable Could be used to invent HTML Can be used instead of HTML Contains no formatting instructions (HTML, CSS, XSLT)

7 XML A tag An element consist of an opening and a closing tag
Is a word enclosed in angle brackets ‘<>’ <tag> An element consist of an opening and a closing tag <tag> opening tag </tag> closing tag (note slash ‘/’) An element can be abbreviated if it contains no further elements <tag /> open and closed in one statement An element may contain any number of elements <tag1><tag2></tag2></tag1> An element may contain text information <tag>Some important information about tags</tag>

8 XML An element can contain attributes which describe the element further <tag attribute=“a green tag” /> The name of an element should describe its content <Sequences></Sequences> Should contain .. Of course .. Sequences Children of an element should be related to the parent <Sequences> <Sequence></Sequence><Sequence></Sequence> </Sequences> Attributes should extent the information logically <Sequences content=“1”> <Sequence id=“gi|532319|pir|TVFV2E|TVFV2E envelope protein” length=“35”> THEACTUALSEQUENCETHATISDESCRIBEDINTHISELEMENT </Sequence>

9 XML Elements and Attributes can be used interchangebly
<Sequence> <identifier>gi|532319|pir|TVFV2E|TVFV2E envelope protein</identifier> <length>35</length> <seq type=“english”> THEACTUALSEQUENCETHATISDESCRIBEDINTHISELEMENT </seq> </Sequence> Attributes and Elements can be mixed If the elements are properly nested and the tags are appropriately opened, closed and matched and the attributes are properly quoted ... Then the XML is well formed It is not neccessarily valid

10 XML Validation Data can be precisely described
Only elements defined may be used in an XML file if it is to be valid Validation styles Document Type Definitions DTD XML Schema Validation ensures that only expected data is in the document Program can then ensure that they understand the document

11 An XML File Declaration for the consumer of the XML file
<?xml version=“1.0” encoding=“UTF-8” ?> A root element which will contain the complete data needs to be defined <Sequences> <!-- Rest of thefile --> </Sequences> An XML file is like a tree consiting of Roots? No .. 1 root Trunk elements immediately following the root Branches (Elements containing further elements [nested]) Leaves (Elements with no contained elements) Where are the attributes?

12 XML File Example <?xml version=“1.0” encoding=“UTF-8” ?>
<Sequences> <Sequence> <identifier>gi|532319|pir|TVFV2E|TVFV2E envelope protein</identifier> <length>35</length> <seq type=“nucleotides”> THEACTUALSEQUENCETHATISDESCRIBEDINTHISELEMENT </seq> </Sequence> <identifier>gi|732413|pir|PDP22E|PDP22E predicted protein</identifier> ANOTHERSEQUENCETHATMAYBERECORDEDSOMEWHERE </Sequences>

13 Styling of XML (4get HTML)
Cascading Stylesheets (CSS) Define styles for the elements that shall be displayed Sequence { Font-famiy: Courier,Monospace; Background-color:#DDDDDD; display:block; border-bottom: 1px outset green; } Identifier { Font-family: Arial,Helvetica font-size: 1.2em; font-weight: bold; Length { display:none; Connecting XML and CSS <?xml-stylesheet type="text/css" href="style.css" ?> Add at the top of the file just after version information

14 Connecting to an XML Schema
The schema or definition is linked in the root element <Sequences xmlns="" xmlns:xsi="" xsi:schemaLocation=" > First attribute namespace Second attribute type of schema (here XML) Third attribute schema location (namespace + URL)

15 JavaScript Object Notation
JSON models objects An object is enclosed in curly braces ‘{ }’ An array is enclosed in braces ‘[ ]’ Pairs “Name” : “Value” Separated by commas Example {“Sequence“ : “ACGCTAGCCGCATCGTAGC”} {“Sequence“ : “ACGCTAGCCGCATCGTAGC”, “Identifier” : “gi|532319|pir|TVFV2E|TVFV2E envelope protein”, “length” : 35 } {“Sequences” : [ {“Sequence“ : “ACGCTAGCCGCATCGTAGC”, “Identifier” : “gi|532319|pir|TVFV2E|TVFV2E envelope protein”, “length” : 19 {“Sequence“ : “ACGCTAACGCTGATCAGCGGCATCGCCGCATCGTAGC”, “Identifier” : “gi|764319|pir|AVBB2E|AVBB2E predicted protein”, “length” : 38 ]

16 XML - JSON JSON is like XML because: JSON is UNlike XML because:
They are both 'self-describing' meaning that values are named, and thus 'human readable' Both are hierarchical. (i.e. You can have values within values.) Both can be parsed and used by lots of programming languages Both can be passed around using AJAX (i.e. httpWebRequest) JSON is UNlike XML because: XML uses angle brackets, with a tag name at the start and end of an element: JSON uses squiggly brackets with the name only at the beginning of the element. JSON is less verbose so it's definitely quicker for humans to write, and probably quicker for us to read. JSON can be parsed trivially using the eval() procedure in JavaScript JSON includes arrays {where each element doesn't have a name of its own} In XML you can use any name you want for an element, in JSON you can't use reserved words from javascript

17 ASN.1 Format

18 End of Theory I 5 min Mindmapping 10 min break

19 Practice I

20 File Formats FASTA Create a FASTA formatted file
Goto Choose FASTAValidator and check if your FASTA file is correct.

21 File Formats XML Write an XML file which will model a sequence similar to FASTA <?xml version=“1.0” ?> <Sequences xmlns="" xmlns:xsi="" xsi:schemaLocation=" <Sequence Identifier=“some id” Type=“nucleotides||amino acids”> THE_SEQUENCE </Sequence> </Sequences>

22 Validation Validate the instance of the XML file that you wrote
Copy and paste your data in one of those forms and see whether your XML is well formed is valid

23 Style the File Use the provided CSS and XML file as a reference
Change the styling such that it is pleasant for you to view

24 U Design Write an XML file which models a person
No validation possible unless you can create the schema as well (not part of this course) Add information about yourself in the file Style the file using CSS Example from Present your creation

25 Theory II

26 Sequence Alignment Exact Approximate simple More difficult target
pattern target pattern

27 Sequence Alignment Exact pattern matching
Naive method aligns pattern with each location of the target Boyer-Moore indexes the pattern to skip some alignments Wu-Manber indexes many patterns and skips some alignments Indexing Suffix tree indexes target and then quickly finds each pattern Many other methods

28 Sequence Alignment Approximate pattern matching Pairwise Multiple
Local Smith Waterman BLAST FASTA Global Needlemann Wunsch Multiple T-Coffee ClustalW ...

29 Basic Local Alignment Seach Tool
Input Pattern Target Search parameters and settings Output Alignments in various formats XML Help

30 BLAST Target Target indexing Needs to be indexed Cannot be FASTA
Must fit to the pattern and BLAST variant protein target and protein pattern can be searched using blastp Target indexing makeblastdb, in the BLAST package can index FASTA files Needs sequence input (e.g. FASTA, asn.1) Needs sequence type to be provided e.g.: protein

31 BLAST blastp Needs indexed database
Needs query sequence (can be unindexed FASTA) Produces alignments

32 Poll class - How many of you have done a BLAST search before?
We’ll all get to do BLAST searches by the end of today How do you do a BLAST search? Determing factors about how effective you are a reaching your goals of identify homologs in the results Scoring model – implict in BLAST program, distinguish true from false positives Database – quantity and quality of information Method – efficient and comprehensive

33 Blast flavors Query: DNA Protein DB: DNA Protein
BlastN - nt versus nt database BlastP - protein versus protein database BlastX - translated nt (6 frames) versus protein database tBlastN - protein versus translated nt database (6 frames) tBlastX - translated nt versus translated nt database (both 6 frames)

34 BLAST Output XML -outfmt 5 This switch leads to XML output

35 Download Blast Get blastp and makeblastdb from mbg404 since you are not allowed to install anything Download a Fasta file (protein, genome, collection of sequences in fasta format) Database must consist of amino acids since we only have access to blastp today Use makeblastdb from the Blast package to index the file Several files will be created when you do it right

36 MakeDB Example More information?
makeblastdb -in seq.fasta -dbtype prot -out seqBl –title seqBlastDB More information? Go to the doc folder of BLAST Documentation is there

37 BLAST Now that we have an indexed database try to run BLAST
Read documentation and try to solve the simplest case You will need the indexed database and you will need a FASTA file as query You could create queries from the database and slightly change them Good luck

38 Theory II

39 Mass Spectra Recording (e.g. Triple Play)
4500 4505

40 Fragmentation Spectrum

41 MS/MS spectra MS/MS spectra can be assigned a peptide sequence (PSM)
Database search De novo sequencing

42 PepNovo Performs de novo sequencing of MS/ MS spectra
Takes a single spectrum as input Needs a mathematical model for its evaluation Will display the results in the console You will therefore need to redirect the output Example ?>PepNovo.exe -dta MSMSSpectrum.dta -model tryp_model.txt

43 De Novo Sequencing 1016.4 901.6 772.4 901.6 129.2 114.8 ~ E (129.1)

44 MS/MS spectra MS/MS spectra can be assigned a peptide sequence (PSM)
Database search De novo sequencing

45 Correlation Database selection >1080ZR IAAYPGVSPGLMIHYNIGR

46 Initalization Files X!Tandem Running X!Tandem That was easy
Taxonomy.xml Default_Input.xml Input.xml Running X!Tandem ?>tandem.exe input.xml That was easy But behold, what about the input?

47 Taxonomy XML <?xml version="1.0" ?>
<bioml label="x! taxon-to-file matching list"> <taxon label="chlamy">   <file format="peptide" URL="" />   </taxon> </bioml>

48 Input.xml Another input file Personally, I don’t approve of
<?xml version="1.0" ?> <bioml>   <note>Each one of the parameters for x! tandem is entered as a labeled note node. Any of the entries in the default_input.xml file can be over-ridden by adding a corresponding entry to this file. This file represents a minimum input file, with only entries for the default settings, the output file and the input spectra file name. See the taxonomy.xml file for a description of how FASTA sequence list files are linked to a taxon name.</note>   <note type="input" label="list path, default parameters">default_input.xml</note>   <note type="input" label="list path, taxonomy information">taxonomy.xml</note>   <note type="input" label="protein, taxon">chlamy</note>   <note type="input" label="spectrum, path">test_spectra.mgf</note>   <note type="input" label="output, path">output.xml</note> </bioml> Another input file Personally, I don’t approve of the XML used here

49 Default-input XML <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="tandem-input-style.xsl"?> <bioml> <note>list path parameters</note> <note type="input" label="list path, default parameters">default_input.xml</note> <note>This value is ignored when it is present in the default parameter list path.</note> <note type="input" label="list path, taxonomy information">taxonomy.xml</note> <note>spectrum parameters</note> <note type="input" label="spectrum, fragment monoisotopic mass error">0.4</note> <note type="input" label="spectrum, parent monoisotopic mass error plus">100</note> <note type="input" label="spectrum, parent monoisotopic mass error minus">100</note> <note type="input" label="spectrum, parent monoisotopic mass isotope error">yes</note> <note type="input" label="spectrum, fragment monoisotopic mass error units">Daltons</note> <note>The value for this parameter may be 'Daltons' or 'ppm': all other values are ignored</note> <note type="input" label="spectrum, parent monoisotopic mass error units">ppm</note> <note type="input" label="spectrum, fragment mass type">monoisotopic</note> <note>values are monoisotopic|average </note> <note>spectrum conditioning parameters</note> <note type="input" label="spectrum, dynamic range">100.0</note> <note>The peaks read in are normalized so that the most intense peak is set to the dynamic range value. All peaks with values of less that 1, using this normalization, are not used. This normalization has the overall effect of setting a threshold value for peak intensities.</note> <note type="input" label="spectrum, total peaks">50</note> <note>If this value is 0, it is ignored. If it is greater than zero (lets say 50), then the number of peaks in the spectrum with be limited to the 50 most intense peaks in the spectrum. X! tandem does not do any peak finding: it only limits the peaks used by this parameter, and the dynamic range parameter.</note> <note type="input" label="spectrum, maximum parent charge">4</note> <note type="input" label="spectrum, use noise suppression">yes</note> <note type="input" label="spectrum, minimum parent m+h">500.0</note> <note type="input" label="spectrum, minimum fragment mz">150.0</note> <note type="input" label="spectrum, minimum peaks">15</note> <note type="input" label="spectrum, threads">1</note> <note type="input" label="spectrum, sequence batch size">1000</note> <note>residue modification parameters</note> </bioml>

50 Beautifying XML XML Formatting of XML Transformation (XSLT)
Only describes data Formatting of XML Additional files can be linked to beautify the display Transformation (XSLT) Translates XML into HTML XML Styling (CSS) Describes formatting to the elements and attributes used in the XML file Both files need to be linked at the beginning of the XML file

51 XML What is an element? What is an attribute? Design a Person
What are attributes of a person? Use elements for logical grouping Use attributes for specific information

52 Styling Connect the example style Nothing will be styled ;)
Examine the CSS file and rename the styles such that your person XML will be somewhat styled

53 End Theory II 5 min mindmapping 10 min break

54 Practice II

55 View Spectra and Sequence
To view matching peaks of the PepNovo prediction and the spectrum at the same time Use the DtaViewer from

56 Download Download PepNovo Download test file
Download test file

57 Try PepNovo Try to run PepNovo Aim Use the given input
Use the help information Use the lecture slides Use the lecture notes Aim Store the result in a text file

58 PepNovo Results are displayed in the console
We need to redirect the output into a file. ?>PepNovo.exe -dta MSMSSpectrum.dta -model tryp_model.txt > result.txt

59 X!Tandem Unzip folder and check Mgf formated spectra (file)
Database file (FASTA) tandem-win folder Used .xml configuration files (default_input.xml, input.xml and taxonomy.xml) To get the same output given in zip folder; Replace configuration files in «tandem-win\bin» folder with ones in «used» folder. Also copy database file to «fasta» folder and .mgf file to «bin» in «tandem-win»

60 X!Tandem Console Application

61 X!Tandem Default Input Parameters such as mass tolerances, enzyme type, number of charged for search can be reset in default_input.xml

62 X!Tandem Input.xml In input.xml file, you should specify path of:
taxonomy.xml default_input.xml Spectra filename Output filename NOTE: Here input.xml and all files above are in same folder(directory))

63 X!Tandem Taxonomy In taxonomy file, you should specify «database file path». In this example, database file is in «fasta» folder in «Xtandem\tandem-win » folder.

64 X!Tandem Output

Download ppt "Computational Biology"

Similar presentations

Ads by Google