Compression of Protein Sequences EE-591 Information Theory FEI NAN, SUMIT SHARMA May 3, 2003.

Slides:



Advertisements
Similar presentations
Nucleic Acids Nucleic Acid Basics Contain instructions to build proteins 2 types: – DNA – RNA Composed of smaller units called nucleotides – Monomer:
Advertisements

BIOINFORMATICS Ency Lee.
1 Genetics The Study of Biological Information. 2 Chapter Outline DNA molecules encode the biological information fundamental to all life forms DNA molecules.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 18: Application-Driven Hardware Acceleration (4/4)
Deoxyribonucleic Acid
Similar Sequence Similar Function Charles Yan Spring 2006.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Principles of Biology By Frank H. Osborne, Ph. D. Molecular Genetics.
Genetics Jeopardy Q $100 Q $200 Q $300 Q $400 Q $500 Q $100 Q $200 Q $300 Q $400 Q $500 Final Jeopardy Bases Climbing the Ladder Genetics! Nice Genes!
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Data Compression Basics & Huffman Coding
C-kit and the D816V Mutation The nucleus of the human cell contains 46 strings of DNA, called CHROMOSOMES, arranged in 23 pairs. Each chromosome actually.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
STATISTIC & INFORMATION THEORY (CSNB134)
DNA Deoxyribonucleic Acid What are we going to learn? What do you want to know?
Unit 7 Lesson 1 DNA Structure and Function
DNA What is it? And what does it do?. What two things did you not know, had forgotten or felt were important?
CSE 6406: Bioinformatics Algorithms. Course Outline
Chapter 6 Genes and Gene Technology
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
National 5 Biology Course Notes Part 4 : DNA and production of
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
An Introduction to Genetic Algorithms Lecture 2 November, 2010 Ivan Garibay
DNA DNA→RNA→Protein.
DNA and Modern Genetics Chapter 5. Chapter 5 Section 1 NOTES Page 135.
DNA by C. Stephen Murray. All life stores its genetic code in a molecule called DNA.
Finding Mathematics in Genes and Diseases Ming-Ying Leung Department of Mathematical Sciences University of Texas at El Paso (UTEP)
The Fourth Macromolecule!!! Objectives: 1.Describe the structure and function of DNA and RNA 2.Explain how DNA replicates itself 3.Explain the purpose.
What are the parts of DNA? Vocabulary word for chapter 6.
Chapter 11 DNA and GENES. DNA: The Molecule of Heredity DNA, the genetic material of organisms, is composed of four kinds nucleotides. A DNA molecule.
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
DNA, mRNA, and Protein Synthesis TAKS Review for April 22 test.
Genetic Variation Goal To learn the basic genetic mechanisms that determines the traits expressed by individuals in a population.
10 Nature, structure and organisation of the genetic material.
DNA Jeopardy. $ $200 $300 $400 $500 $100 $200 $300 $500 $400 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500.
5.4: Genes, DNA, and Proteins 7.1.a Students know cells function similarly in all living organisms.
Sequence Alignment.
 passing on of characteristics from parents to offspring.
Nucleic Acids Nucleic acids provide the directions for building proteins. Two main types…  DNA – deoxyribonucleic acid  Genetic material (genes) that.
DNA Deoxyribose Nucleic Acid – is the information code to make an organism and controls the activities of the cell. –Mitosis copies this code so that all.
IGCSE BIOLOGY SECTION 3 LESSON 3. Content Section 3 Reproduction and Inheritance a)Reproduction - Flowering plants - Humans b) Inheritance.
DNA Deoxyribonucleic Acid “living code”. DNA The genetic material of a cell contains information for the cell’s growth and other activities.
THE GENETIC CODE THE STRUCTURE OF DNA. WHAT IS THE DIFFERENCE BETWEEN CHROMOSOME, GENE, AND DNA? A gene is a section of DNA that gives the code for a.
13.3 Mutations KeyQuestions: 1)What are mutations? 2)How do mutations affect genes? The sequence of bases in DNA are like the letters of a coded message.
Lesson Overview 13.3 Mutations. THINK ABOUT IT The sequence of bases in DNA are like the letters of a coded message. What would happen if a few of those.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Mutations.
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Nucleic Acids 2 Types What do they do? DNA- deoxyribonucleic acid
Transcription.
What is the structure and function of DNA?
DNA: Deoxyribonucleic Acid
MODERN GENETICS DNA.
DNA & The Genetic Code The sequence (order) of bases in a strand of DNA acts as a template for DNA replication and makes the code for building proteins.
What is the structure and function of DNA?
REVIEW DNA DNA Replication Transcription Translation.
DNA Structure.
DNA: the molecule of heredity
DNA.
Applying principles of computer science in a biological context
Chapter 12 DNA and GENES.
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Compression of Protein Sequences EE-591 Information Theory FEI NAN, SUMIT SHARMA May 3, 2003

GENERAL OVER VIEW OF THE PROJECT  DNA and Protein form the basic structure of the life and they can be represented by any normal text file.  When it comes to compression, we see that they don’t show the good result from the normal compression techniques.  In this project we have used a new technique called CP (Compression Scheme) for their compression and analysis it how far it is good in Protein Compression.

THE NEED OF THE COMPRESSION There are two different motivation for the compression  Compression enables efficient use of the resource such as storage and bandwidth  From scientific perspective it provides a way of capturing and quantifying structure in the sequence We here have put stress on the second one as we are here dealing with the biological sequence

 A good model for compression will contain a few symbols with high probability (and preferably one dominant symbol), thus allowing very compact coding of those probable symbols.

 We see that DNA is the genetic material of the life it transmits information from one generation to another  We can represent DNA as sequence of the symbol represented by four symbol alphabet of nucleotides A adenine C cytosine G guanine T thymine  A will only bond with T and G will only bond with C

DNA STRUCTURE

Protein Structure  Since we know that Protein sequence is a very large one as it is composed of 20 amino acid so they have a high level of redundancy since we can represent the sequence in a normal text file.  But when we talk abut the compression we have to take care of the fact they are biological sequence and protein are subjected to mutation that destroy repetition.

3-D Protein Structure

PROTEIN STRCTURE

 The redundancy in protein majorly comes through two sources  New genes arise through duplication  Mistake made while copying DNA and other cellular process

 In compression of the protein we take into account a distance metric this distance reflect their mutation probabilities that is symbol that are close together are derived from the same symbol by mutation and if far apart other wise.  In our scheme we have taken distance to combine the prediction made by different context,we sum up over all possible context up to a certain length weighted by their similarity to the current context.

 In our project we have make use of the following concepts  PPM  Adaptive coding

PROBABILITIC PREDICTION METHOD(PPM)  The basic idea of the PPM is to used last few character to predict the probabilities of the coming one. abcd aba aba  Example –we have a sequence abcd that end with aba so PPM will calculate the previous occurrence of the aba and tallies with the symbol that will occur next

CP COMPRESSION SCHEME  In this scheme we take into consideration that biological sequence constantly undergo into mutation and as long as the new sequence has similar properties the mutation will be accepted,thus exact repletion is overlaid with mutation which is modeled by the distance function,thus it is desirable to take into account in compression scheme  In general this is given by the formula

f  We see in the equation that weight frequent context more highly this has some merits as it has less variance in distribution thus to improve this more we take the weight context equally,we do that by converting f

Explanation of the program  Calculation of the first order  Calculation of the second order  Calculation of the third order

Major Function  There are major functions in the code. 1. Function to read the mutation matrix 2. Function to read the target sequence n 3. Function to compute the n-order value 4. Function to compute the dynamic probability based on the CP algorithm (a major part distinct from some other existing algorithm of the 1987 CACM article by Witten, Neal, and Cleary. ) 5. Main function

Modules in the Code  There is a function to calculate the dynamic probability of each symbol. i.e. When we read a new symbol, we increase the occurrence of that symbol by 1 and increase the total number of symbols we have read by 1 and use this function module to compute the probability dynamically.

ADAPTIVE CODING  Adaptive coding doesn’t require the probabilities to be transmitted with the encoded data  Require only one pass through the data coding  Doesn’t use the fixed symbol probabilities

Time of Execution of Each Text

nn  As we can see from the diagram in the last slide, the time for the successful execution of n-order CP is increased with the n by a factor of 20*. n  So we cannot make n too large although we might receive a good compression ratio. *Craig G, Nevill-Manning, Ian H. Witten Protein is incompressible

Improvements  Improvement of the execution time.  Improvement of the float precision.  Solution  By addition of high-performance hardware  By Optimization of software algorithm

Thanks ?