Bioinformatics Tigor Nauli / Research Center for Informatics - LIPI.

Slides:



Advertisements
Similar presentations
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Bioinformatics and Phylogenetic Analysis
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
The Cell, Central Dogma and Human Genome Project.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
How to use the web for bioinformatics Ethan Strauss X 1171
Protein Structures.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Biological Databases By : Lim Yun Ping E mail :
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Organizing information in the post-genomic era The rise of bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
REMINDERS 2 nd Exam on Nov.17 Coverage: Central Dogma of DNA Replication Transcription Translation Cell structure and function Recombinant DNA technology.
Sequence Search and Analysis SPE 1653 (703)
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Function preserves sequences
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
EB3233 Bioinformatics Introduction to Bioinformatics.
Bioinformatics and Computational Biology
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Sequence Alignment.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Making Sense of DNA and Protein Sequences Based on a NCBI minicourse Presented by Jae-Hyung Lee, ISU June 14, 2007.
Pairwise Sequence Alignment
Bioinformatics Overview
생물정보학 Bioinformatics.
Predicting Active Site Residue Annotations in the Pfam Database
Genomes and Their Evolution
Introduction to Bioinformatics
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Protein Structures.
Protein Sequence Analysis - Overview -
Introduction to Databases
Basic Local Alignment Search Tool
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Bioinformatics Tigor Nauli / Research Center for Informatics - LIPI

Topics Definition Biological database Sequence alignment Gene prediction Phylogenetic analysis Protein structure prediction Other studies Conclusion

Definition Bioinformatics is –the application of computational tools and techniques to the management and analysis of biological data. –information technology (IT) in molecular biology. –a subset of the larger field of computational biology. The term of bioinformatics is being used in a number ways depending on who using it.

Definition The field of bioinformatics relies heavily on work by experts in statistical methods and pattern recognition. Bioinformatics is an in silico research.

Biological database Database –an archive of information. –a logical organization of information. –tools to gain acess to it. Biological data –cover nucleic acid and protein sequences, macromolecular structures, and function –being generated by the efficient large- sequencing machines. –being submitted by molecular biologists around the world. –.

Biological database Archival database of biological information –nucleic acid and protein sequences –protein expression patterns –sequence motifs (signature patterns) –mutations and variants in sequences –classification or relationships of protein sequence families or protein folding patterns –bibliographic

Biological database Databank for nucleotide database –GenBank is maintained by National Center for Biotechnology Information (NCBI) –EMBL (European Molecular Biology Laboratory)

Biological database Databank for annotated protein sequence –SWISS-PROT is maintained by European Bioinformatics Institute (EBI) Databank for sequence profiles, patterns, and motifs –PROSITE Databank for protein structure –Protein Data Bank

Biological database Database queries –given a sequence, or fragment of a sequence, find sequences in the database that are similar to it –given a protein structure, or fragment, find protein structures in the database that are similar to it such searches are carried out thousands of times a day

Biological database Database queries –given a sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures –given a protein structure, find sequences in the databank that correspond to similar structures are active fields of research

Biological database GenBank file –may contain identifying, descriptive, and genetic information in ASCII-format –for example:

LOCUS AF bp mRNA linear INV 03-JAN-2000 DEFINITION Drosophila melanogaster transcription factor Toy (toy) mRNA, complete cds. ACCESSION AF VERSION AF GI: KEYWORDS. SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. REFERENCE 1 (bases 1 to 1734) AUTHORS Czerny,T., Halder,G., Kloter,U., Souabni,A., Gehring,W.J. and Busslinger,M. TITLE twin of eyeless, a second Pax-6 gene of Drosophila, acts upstream of eyeless in the control of eye development JOURNAL Mol. Cell 3 (3), (1999) MEDLINE PUBMED REFERENCE 2 (bases 1 to 1734) AUTHORS Czerny,T., Halder,G., Kloter,U., Souabni,A., Gehring,W.J. and Busslinger,M. TITLE Direct Submission JOURNAL Submitted (11-MAR-1999) Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, Vienna A-1030, Austria

FEATURES Location/Qualifiers source /organism="Drosophila melanogaster" /mol_type="mRNA" /db_xref="taxon:7227" /chromosome="IV" /map="102E1" /dev_stage="embryo" gene /gene="toy" /note="twin of eyeless; second Pax-6" CDS /gene="toy" /codon_start=1 /product="transcription factor Toy" /protein_id="AAD " /db_xref="GI: " /translation="MMLTTEHIMHGHPHSSVGQSTLFGCSTAGHSGINQLGGVYVNGR PLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIKPRAIGGSKPR VATTPVVQKIADYKRECPSIFAWEIRDRLLSEQVCNSDNIPSVSSINRVLRNLASQKE QQAQQQNESVYEKLRMFNGQTGGWAWYPSNTTTAHLTLPPAASVVTSPANLSGQADRD DVQKRELQFSVEVSHTNSHDSTSDGNSEHNSSGDEDSQMRLRLKRKLQRNRTSFSNEQ IDSLEKEFERTHYPDVFARERLADKIGLPEARIQVWFSNRRAKWRREEKMRTQRRSAD TVDGSGRTSTANNPSGTTASSSVATSNNSTPGIVNSAINVAERTSSALISNSLPEASN GPTVLGGEANTTHTSSESPPLQPSAPRLPLNSGFNTMYSSIPQPIATMAENYNSSLGS MTPSCLQQRDAYPYMFHDPLSLGSPYVSAHHRNTACNPSAAHQQPPQHGVYTNSSPMP SSNTGVISAGVSVPVQISTQNVSDLTGSNYWPRLQ" misc_difference 1605 /gene="toy" /note="compared to genomic sequence; aspartic acid to glutamic acid change" /replace="a"

ORIGIN 1 taattaatta tgatgctaac aactgaacac ataatgcatg ggcatcccca ctcgtcagtc 61 gggcagagta ctctatttgg gtgctccacg gcgggccata gcggaataaa tcagctgggc 121 ggcgtatatg ttaatggccg gccactgccc gattcaacgc gtcaaaaaat tgtcgaattg 181 gctcattccg gcgcacgtcc ttgtgatatt tcaagaatac tacaagtgtc caacggttgc 241 gtaagcaaaa ttttgggcag atattatgaa actggatcga taaaacctcg agctataggt 301 ggttcaaagc cacgagtagc tacaaccccg gttgtgcaaa aaattgcaga ttacaaacgg 361 gaatgtccca gcatatttgc gtgggaaata cgagatcgac tgctatcgga acaagtttgc 421 aatagtgata acattccaag tgtttcatct attaatcgag tcttacgtaa cctggcctca 481 caaaaggagc agcaagctca gcaacaaaac gaatccgttt atgaaaagct tcgcatgttt 541 aatggccaaa cgggcggatg ggcatggtat ccaagcaata caacgacggc acatttgacg 601 ctaccaccag cagcttccgt tgtgacatct cctgcaaatt tatcaggaca ggccgatcgg 661 gatgatgttc aaaaaagaga attacaattt tcagtagaag tttcgcatac aaactctcac 721 gatagtacat cggatggaaa ctctgaacat aattcatccg gggacgaaga ctctcaaatg 781 cggttgcgcc taaaaaggaa gttacagcgc aatcggacat cattttctaa tgagcaaatt 841 gacagtcttg aaaaagaatt tgaaagaaca cattatcccg atgtttttgc gcgagaaagg 901 cttgctgata aaattggttt gccagaggca cgtattcagg tttggttttc aaaccgacga 961 gctaaatggc gccgagaaga aaaaatgcga actcagagac gatcggccga taccgtggac 1021 ggcagtggtc gaaccagcac ggcaaataat ccttcaggaa cgactgcatc ttcctccgtc 1081 gcaacgtcaa acaactcaac tccagggatt gtgaactcag caatcaacgt tgcggaacga 1141 acatcatctg cattaattag taatagcctt cccgaggctt caaatggacc aactgttttg 1201 ggtggtgaag ctaatactac acacaccagc tctgaaagcc caccccttca gccatcggca 1261 ccgcggctac ccttaaattc tggattcaac accatgtact catctattcc acaaccgatt 1321 gcaacgatgg ctgaaaatta caactcctca ttaggatcaa tgaccccgtc atgcttacaa 1381 caacgcgatg cctatcctta catgtttcac gatccgttat cactaggatc tccctatgtg 1441 tcagcccacc atcgaaacac agcttgcaac ccctcagctg cgcaccaaca gccccctcag 1501 catggcgttt ataccaatag ttctccaatg ccatcatcaa acacaggtgt catttctgcg 1561 ggcgtttcgg tgcctgtcca gatttcaacg caaaatgtat ctgacctaac gggaagcaat 1621 tactggccac gtcttcagtg atcgtcaatc tttggctcac cattagatca tttgtcaaag 1681 gcgactgccg ctgcaatcat tgccgcacaa gcagctgaga aaagccataa acac //

Sequence alignment The basic sequence analysis task is to ask if two sequences are related –sequence similarity/homology When we compare sequences, we are considered that they have diverged by a process of mutation. The mutational process are substitutions, which change residues in a sequence, and insertions and deletions, which add or remove residues. The three ways an alignment can be extended: match, mismatch, and gap.

Sequence alignment We use dynamic programming matrix to find the optimal alignment. To align CATGT with ACGCTG, first we fill the matrix with scores: +2 for match -1 for mismatch -1 for gap

Sequence alignment The maximum score will be: The best alignment is: The example output using BLAST program:

Gene prediction Predicting gene locations –identify all the open reading frame (ORF) in unannotated DNA –a query sequence will be compared to an entire annotated DNA database to find similar sequences –based on Bayesian statistics to find the most probable subsequence appears following the known subsequence P(CCGAT)=P(CC)*P(G|CC)*P(A|CG)*P(T|GA)

Gene prediction –implementing the Hidden Markov model

Phylogenetic analysis –the process of developing hypotheses about the evolutionary relatedness of organisms based on their observable characteristics Phylogenetic tree –build from multiple sequence alignment

Phylogenetic analysis –implemets Parsimony method, UPGMA, Cladistic, Neighbor Joining, Least Squares Method, Maximum Likelihood, or Clustering, to determine the differences in the sequences –find the relatedness by clustring

Phylogenetic analysis –the percentage of identity: –phylogenetic tree:

Protein structure prediction Two approaches in computational modeling of protein structure –knowledge-based modeling employ parameters extracted from the database of existing structures to evaluate and optimize structures –predict structure from sequence

Protein structure prediction Predict from sequence: –select the protein sequence of the target –determine the secondary structure by calculating its hydrophobicity values

Protein structure prediction –align the structure with the similar sequence in databank –find the list of angles –draw the structure

Others studies Protein structure property analysis Biochemical simulation Whole genome analysis Primer design DNA microarray analysis Proteomics analysis

Conclusion Bioinformatics can provide anything from the abstraction of the properties of a biological system into a mathematical or physical model, to the the implementation of new algorithms for data analysis.