Presentation on theme: "Bioinformatics Tigor Nauli / Research Center for Informatics - LIPI."— Presentation transcript:
Bioinformatics Tigor Nauli / Research Center for Informatics - LIPI
Topics Definition Biological database Sequence alignment Gene prediction Phylogenetic analysis Protein structure prediction Other studies Conclusion
Definition Bioinformatics is –the application of computational tools and techniques to the management and analysis of biological data. –information technology (IT) in molecular biology. –a subset of the larger field of computational biology. The term of bioinformatics is being used in a number ways depending on who using it.
Definition The field of bioinformatics relies heavily on work by experts in statistical methods and pattern recognition. Bioinformatics is an in silico research.
Biological database Database –an archive of information. –a logical organization of information. –tools to gain acess to it. Biological data –cover nucleic acid and protein sequences, macromolecular structures, and function –being generated by the efficient large- sequencing machines. –being submitted by molecular biologists around the world. –.
Biological database Archival database of biological information –nucleic acid and protein sequences –protein expression patterns –sequence motifs (signature patterns) –mutations and variants in sequences –classification or relationships of protein sequence families or protein folding patterns –bibliographic
Biological database Databank for nucleotide database –GenBank is maintained by National Center for Biotechnology Information (NCBI) –EMBL (European Molecular Biology Laboratory)
Biological database Databank for annotated protein sequence –SWISS-PROT is maintained by European Bioinformatics Institute (EBI) Databank for sequence profiles, patterns, and motifs –PROSITE Databank for protein structure –Protein Data Bank
Biological database Database queries –given a sequence, or fragment of a sequence, find sequences in the database that are similar to it –given a protein structure, or fragment, find protein structures in the database that are similar to it such searches are carried out thousands of times a day
Biological database Database queries –given a sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures –given a protein structure, find sequences in the databank that correspond to similar structures are active fields of research
Biological database GenBank file –may contain identifying, descriptive, and genetic information in ASCII-format –for example:
LOCUS AF bp mRNA linear INV 03-JAN-2000 DEFINITION Drosophila melanogaster transcription factor Toy (toy) mRNA, complete cds. ACCESSION AF VERSION AF GI: KEYWORDS. SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. REFERENCE 1 (bases 1 to 1734) AUTHORS Czerny,T., Halder,G., Kloter,U., Souabni,A., Gehring,W.J. and Busslinger,M. TITLE twin of eyeless, a second Pax-6 gene of Drosophila, acts upstream of eyeless in the control of eye development JOURNAL Mol. Cell 3 (3), (1999) MEDLINE PUBMED REFERENCE 2 (bases 1 to 1734) AUTHORS Czerny,T., Halder,G., Kloter,U., Souabni,A., Gehring,W.J. and Busslinger,M. TITLE Direct Submission JOURNAL Submitted (11-MAR-1999) Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, Vienna A-1030, Austria
Sequence alignment The basic sequence analysis task is to ask if two sequences are related –sequence similarity/homology When we compare sequences, we are considered that they have diverged by a process of mutation. The mutational process are substitutions, which change residues in a sequence, and insertions and deletions, which add or remove residues. The three ways an alignment can be extended: match, mismatch, and gap.
Sequence alignment We use dynamic programming matrix to find the optimal alignment. To align CATGT with ACGCTG, first we fill the matrix with scores: +2 for match -1 for mismatch -1 for gap
Sequence alignment The maximum score will be: The best alignment is: The example output using BLAST program:
Gene prediction Predicting gene locations –identify all the open reading frame (ORF) in unannotated DNA –a query sequence will be compared to an entire annotated DNA database to find similar sequences –based on Bayesian statistics to find the most probable subsequence appears following the known subsequence P(CCGAT)=P(CC)*P(G|CC)*P(A|CG)*P(T|GA)
Gene prediction –implementing the Hidden Markov model
Phylogenetic analysis –the process of developing hypotheses about the evolutionary relatedness of organisms based on their observable characteristics Phylogenetic tree –build from multiple sequence alignment
Phylogenetic analysis –implemets Parsimony method, UPGMA, Cladistic, Neighbor Joining, Least Squares Method, Maximum Likelihood, or Clustering, to determine the differences in the sequences –find the relatedness by clustring
Phylogenetic analysis –the percentage of identity: –phylogenetic tree:
Protein structure prediction Two approaches in computational modeling of protein structure –knowledge-based modeling employ parameters extracted from the database of existing structures to evaluate and optimize structures –predict structure from sequence
Protein structure prediction Predict from sequence: –select the protein sequence of the target –determine the secondary structure by calculating its hydrophobicity values
Protein structure prediction –align the structure with the similar sequence in databank –find the list of angles –draw the structure
Others studies Protein structure property analysis Biochemical simulation Whole genome analysis Primer design DNA microarray analysis Proteomics analysis
Conclusion Bioinformatics can provide anything from the abstraction of the properties of a biological system into a mathematical or physical model, to the the implementation of new algorithms for data analysis.