Presentation on theme: "Computer Science and Bioinformatics James Edwards and Rajinder Singh Bhatti."— Presentation transcript:
Computer Science and Bioinformatics James Edwards and Rajinder Singh Bhatti.
Biology and Computer Science? Initially Biology depended on Chemistry to make major strides Biochemistry Biology then needed to work at atomic level explaining phenomena Biophysics The modern era of Biology needs to interpret a wealth of data, tools that only computer Science is able to provide Hence Bioinformatics
What is Bioinformatics? The study of computational methods to expand the use of biological data (Data Orientated). Often (incorrectly) used instead of the term ‘Computational Biology’. However this is a slightly different discipline. Computational Biology is the use of computational and mathematical methods to study or simulate biological systems (Hypothesis Orientated). [source National Institutes of health]
Overlaps Between the two Disciplines 1 – Bioinformatics problems 2 – Computational Biology problems 3 – Problems in both categories 4 – Problems in neither category
Motivation for Bioinformatics Quote from Donald Knuth Turing Award winner: “…I can’t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on. It’s at that level.” [source – Wikiquotes] Can Biological life be equated with Computing? Results so far would suggest the answer is yes!
Common Bioinformatics Problems Finding and assessing Similarities between Strings (next slides). Detecting patterns in strings. Constructing trees of the evolution of organisms. Classifying new data by clustering existing data. Also applications of Machine Vision to detect interactions between proteins
Data Structures Used in Biology Strings for representing sequences (e.g. DNA, RNA, Amino Acid Sequences). “ATACGGCGCGCAAGGCT” “TATGCCGCGCGTTCCGA” Trees for representing the evolution of organisms and other purposes. Prokaryotes Eukaryotes ReptilesBirds ……
Data Structures (Cont..) Graphs can represent signalling pathways (often found in Neural networks) 3d Points and their Linkages can represent protein structures.
First Instance of a Problem – DNA Shotgun Sequencing In order to derive a DNA sequence, the DNA must first be duplicated many times. ATCACCGTAAGAGGA It must then be processed by Gel Electrophoresis, which ‘chops’ the DNA into smaller pieces named ‘fragments’. ATCACCGTAAGAGGA ATC CCGT AGGA AAG CCGT AAGA TCA This is a very simplified Instance of the problem typically each fragment can be between 250 and 1000 Bases long. TAA GTA
Alignments – the Smith Waterman Method. How do we identify fragments which link together? Can use dynamic programming to compute optimal alignment scores between fragments. Align with either match (1) gap -(1/3 x length of gap) or mismatch (-1). The score in each cell is the best total score from an already chosen cell/row + the cost of the alignment. If a score is < 0 it is said to be 0. The first row is always filled with 0’s 000A 010T 000A CTA
The Smith Waterman algorithm (Continued) Following this trace back a path through the optimum alignment starting at the highest number in the matrix to the first 0. In this case it is: ‘AT’ Algorithm extremely expensive O(NM) run time and O(NM) storage complexity. Always finds optimum solution. 000A 010T 000A CTA
Alternative to SW Algorithm Sequences are usually at the very least tens of thousands of characters long Makes O(NM) runtime (and storage complexity) unacceptable. Alternative – use BLAST (Basic Local Alignment Search Tool) Algorithm. Gives a much more reasonable run time of O(N+M). However does not always compute best solution.
BLAST Algorithm Computing an entire matrix of values will always require N x M space. Iterating over values will always require N x M Space. Solution: Ignore parts of the alignment which are unlikely to improve the score. This improves the Storage Complexity as only a singular alignment must be stored. It also improves the Runtime Complexity as at each stage of the algorithm only the optimum so far is processed.
BLAST Illustrated The strings at the beginning and end are very unlikely to improve the score of the alignment. Therefore no gap and mismatches are computed in the matrix CTCTCTCTCATTGATTGCGGGGGG GGGGGGGGGATTGATTGCCCCCCC ATTGATTGC Consider forming an alignment between two sequences:
Alignments Relation to Shotgun Sequencing. So now there is a way to measure which fragments are likely to align we still need a way to find the correct order efficiently. In depth Algorithm beyond scope of presentation However the best current techniques are: Greedy Methods (align every element – then use only best solutions). Evolutionary Algorithms (start with initial set of solutions, computing sum of alignment scores then ‘evolve’ set of solutions in each iteration). Problem is NP- Hard – Techniques give Approximations.
Relating Computer Science to Biology What have us Computer Science students studied so far in this MSc course that can have some use to Bioinformatics? Data Mining Artificial Intelligence Heuristic approaches (e.g. Knowledge Representation – Logics) Algorithm Techniques
Data Mining and Bioinformatics How and why? Some of you do COMP 527 Data Mining with Rob Why Data Mining is essential in Bioinformatics. KDD (Knowledge Discovery DB) is the process of finding useful information and patterns in data. Data Mining is the use of algorithms to extract information and patterns derived by the KDD process. Graphical Techniques such as Brush, Data smoothing etc.
Data Mining and Bioinformatics Algorithm implementation examples Data Mining algorithm use for tackling problems in Bioinformatics In conjunction with microarray Technology Predict a patients outcome, such as survival time disease recurrence health risk assessments etc... How does Data Mining help? Accurate predictions could help provide better treatment!
AI and Bioinformatics Artificial Intelligence? Research in genetics, molecular biology etc. generate enormous amounts of data Use AI to extract useful information from the wealth of available data Build good probabilistic models (gene models) AI provides several powerful algorithms and techniques solving these problems using the stored data
AI and Bioinformatics AI techniques used Neural networks (Biological and Artificial) Hidden Markov models (Probabilistic Statistical models) Bayesian networks (Models logic) and many others....
Logic and Bioinformatics Biology works by applying prior knowledge “what is known” to unknown entities. Therefore Biology said to be knowledge- based (rather than axiom based) Use pre-existing knowledge to make inferences about the item under investigation. Description Logic?
Description Logic and Bioinformatics Why description Logic? decidable logic with good systems impossible for a single biologist to deal with all of a domains knowledge! similar to programmers writing extremely complex programs without an IDE to help with libraries medical diagnosis systems make good use of ABOX and TBOX assertions for example, determine if a patients problem is an element of a particular known disease
Description Logic Example TBOX sick person isInfected.Cancer non_sick person isInfected.Cold ABOX Tim : personSteven : person Cancer : ProblemCold : Problem (Tim, Cancer) : isInfected (Steven, Cold) : isInfected
Improvements How far has Bioinformatics come? “One is struck both by how far the field has come in a relatively short period of time, and also by how far it has yet to go.” - Jessica D. Tenenbaum The discipline of Bioinformatics has vastly improved over recent years due to Fast technological development of the computer industry Demand for Computer Scientists - more computer scientists than ever before! Biological “unknown” discoveries – things that are discovered with no previous knowledge base Growing of sub-Biology interests, such as molecular Biology
Improvements How far will Bioinformatics go? Thoroughly depends if the gap between Biology and Computer Science increases or decreases The gap increases if educational institutions decide ignore Bioinformatics Put emphasis on prospective students Computer Scientists choose to ignore Biology Biologists choose to ignore Computer Science
Closing the gap I Biologists cannot build their own analytical tools Computer Scientists don't know what to build!
Closing the gap II Putting a Computer Scientist (Data Mining expert) into a room with a Biologist investigator wont solve the problem Boundaries such as methodologies and discipline language are a problem.
Closing the gap III Computer Science is the “science of the artificial” Biology is the “science of discovery” The only way to bridge the gap is for both parties to learn the basic fundamentals of each science
Breakthroughs of Bioinformatics Spatial patterns of structures for understanding protein folding, evolution, and biological functions To predict protein functions, we develop a method by rapidly matching local surfaces and by incorporating evolutionary information specific to individual binding region via a Bayesian Monte Carlo approach. These kinds of breakthroughs encourage the computer industry to get involved and work with Biology.
Related Problems? Are there any other disciplines which involve the similar integration of Computer Science with Biology? Cheminformatics/Chemoinformatics the application of informatics tools to solve discovery chemistry problems an integral component of hit and lead generation development of new computational methods or efficient algorithms for chemical software, and pharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery
Related Problems? Are there any other disciplines which involve the similar integration of Computer Science with Biology? Other similar interests Ecoinformatics Geoinformatics Quantum informatics Astroinformatics Business informatics And many others...
Follow ups of Jacques Cohen Bioinformatics—an introduction for computer scientists is a previous publication from Jacques Cohen aims to encourage Computer Scientists to get involved with Biology Updating Computer Science Education released after Bioinformatics and Computer Science Talks about encouraging the next generation of Computer Scientists that Computer Science is more than just programming.
Who is Jacques Cohen? Currently serving Brandies University since Docter in the field of analysis of algorithms, parsing and compiling, memory management, logic and constraint logic programming, and parallelism Recently started researching his interest of Bioinformatics His most recent publication is about methods used in microarray Data Interpretation See
References and related material (All web links last accessed 4 th February 2008) Shotgun sequencing G. Luque, E. Alba Torres and S. Khuri, Assembling DNA Fragments with a Distributed Genetic Algorithm, Parallel Computing for Bioinformatics and Computational Biology, Wiley- Interscience, New Jersey, 2006, Chapter 12, pp L.D. Paulson, Bioinformatics Experiences Important Breakthroughs, 2005, pp J. Cohen, Bioinformatics: An Introduction for Computer Scientists, ACM Computing Surveys, 36(2), , B. Tjaden, J. Cohen, A Survey of Computational Methods used in Microarray Data Interpretation, Applied Mycology and Biotechnology, Bioinformatics 6, J. Cohen, Updating Computer Science Education, Communications of the ACM, 48(6), 29-31, J. Cohen, Computational Molecular Biology: A Promising Application Using Logic Programming and Constraint Logic Programming, Lecture Notes in Artificial Intelligence, R. Stevens, C.A. Goble and S. Bechhofer, Ontology-based Knowledge Representation for Bioinformatics, Jinyan Li, Limsoon Wong and Qiang Yang, Data Mining in Bioinformatics, Various material about Bioinformatics, Data Mining in Bioinformatics, muenchen.de/Forschung/Bioinformatics/http://www.dbs.informatik.uni- muenchen.de/Forschung/Bioinformatics/