Presentation is loading. Please wait.

Presentation is loading. Please wait.

Master’s course Bioinformatics Data Analysis and Tools Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW C E N T.

Similar presentations


Presentation on theme: "Master’s course Bioinformatics Data Analysis and Tools Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW C E N T."— Presentation transcript:

1 Master’s course Bioinformatics Data Analysis and Tools Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW heringa@few.vu.nl C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

2 Course objectives There are two extremes in bioinformatics work –Tool users (biologists): know how to press the buttons and know the biology but have no clue what happens inside the program –Tool shapers (informaticians): know the algorithms and how the tool works but have no clue about the biology Both extremes can be dangerous at times, need a breed that can do both

3 Course objectives How do you become a good bioinformatics problem solver? –You need to know basic analysis and data mining modes –You need to know some important backgrounds of analysis and prediction techniques (e.g. statistical thermodynamics) –You need to have knowledge of what has been done and what can be done (and what not) Is this enough to become a creative tool developer? –Need to like doing it –Experience helps

4 Course objectives The most important thing in tools creating (and science in general) Be able to ask the proper question! –Should address a real problem –Should be targeted –Should be solvable Bioinformatics challenge: from Genomics to Systems Biology –Bottom up: start at the components, assemble and learn the system –Top down: observe the system behaviour, model and learn the details –What about bottom down or top up questions?

5 Contents (tentative dates) DateLecture TitleLecturer 1 [wk 19] 01/04/08IntroductionJaap Heringa 2 [wk 19] 03/04/08Microarray data analysisJaap Heringa 3 [wk 20] 08/04/08Machine learningJaap heringa 4 [wk 21] 10/04/08Clustering algorithmsBart van Houte 5 [wk 21] 15/04/08Feature Selection Bart van Houte 6[wk 23] 17/04/08Molecular Simulation & Sampling TechniquesAnton Feenstra 7[wk 23] 22/04/08Introduction to Statistical Thermodynamics IAnton Feenstra 8[wk 24] 24/04/08Introduction to Statistical Thermodynamics II Anton Feenstra 9[wk 24] 06/05/08Databases and parsingSandra Smit 10[wk 24] 08/05/08Semantic Web and OntologiesFrank van Harmelen 11[wk 25] 13/05/08 Parallelisation& Grid Computing Thilo Kielmann 12[wk 25] 15/05/08Application area I: Protein Domain Prediction Jaap Heringa 13[wk 25] 20/05/08Application Area II: Repeats Detection Jaap Heringa

6 At the end of this course… You will have seen a couple of algorithmic examples You will have got an idea about methods used in the field You will have a firm basis of the physics and thermodynamics behind a lot of processes and methods You will have learned about state-of-the-art computational issues, such as Semantic Web and HTP Computing You will have an idea of and some experience as to what it takes to shape a bioinformatics tool

7 Bioinformatics “Studying informatic processes in biological systems” (Hogeweg) Applying algorithms and mathematical formalisms to biology (genomics) “Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith)

8 This course General theory of crucial algorithms (GA, NN, HMM, SVM, etc..) Method examples Research projects within own group –Repeats –Domain boundary prediction Physical basis of biological processes and of (stochastic) tools

9 Bioinformatics Large - external (integrative)ScienceHuman Planetary ScienceCultural Anthropology Population Biology Sociology SociobiologyPsychology Systems Biology Biology Medicine Molecular Biology Chemistry Physics Small – internal (individual) Bioinformatics

10 Genomic Data Sources DNA/protein sequence Expression (microarray) Proteome (xray, NMR, mass spectrometry) PPI Metabolome Physiome (spatial, temporal) Integrative bioinformatics

11 Protein structural data explosion Protein Data Bank (PDB): 14500 Structures (6 March 2001) 10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...

12 Mathematics Statistics Computer Science Informatics Biology Molecular biology Medicine Chemistry Physics Bioinformatics Bioinformatics inspiration and cross-fertilisation

13 Joint international programming initiatives Bioperl http://www.bioperl.org/wiki/Main_Page http://bioperl.org/wiki/How_Perl_saved_human_genome Biopython http://www.biopython.org/ BioTcl http://wiki.tcl.tk/12367 BioJava www.biojava.org/wiki/Main_Page

14 Algorithms in bioinformatics string algorithms dynamic programming machine learning (NN, k-NN, SVM, GA,..) Markov chain models hidden Markov models Markov Chain Monte Carlo (MCMC) algorithms stochastic context free grammars EM algorithms Gibbs sampling clustering tree algorithms (suffix trees) graph algorithms text analysis hybrid/combinatorial techniques and more… Some techniques can be reapplied to many different problems, e.g. clustering, NN, etc.

15 Algorithms in bioinformatics Approaches in methods can be reapplied, e.g. kowledge-based (inverse) prediction Different approaches can be combined, e.g. consensus prediction, metaservers

16 DBT hits PSSM Q Discarded sequences Run query sequence against database Run PSSM against database PSI-BLAST iteration

17 Fold recognition by threading: THREADER and GenTHREADER Query sequence Compatibility scores Fold 1 Fold 2 Fold 3 Fold N

18 Polutant recognition by microarray mapping: Compatibility scores Cond. 1 Cond. 2 Cond. 3 Cond. N Contaminant 1 Contaminant 2 Contaminant 3 Contaminant N Query array

19 Protein-protein interaction prediction Two extreme approaches, phlogenetic prediction and Molecular Dynamics (MD) simulation Mesoscopic modelling Soft-core Molecular Dynamics (MD) –Fuzzy residues –Fuzzy (surface) locations

20 ENFIN WP5 - BioRange (Anton Feenstra) Protein-protein interaction prediction Mesoscopic modelling Soft-core Molecular Dynamics (MD) –Fuzzy residues –Fuzzy (surface) locations

21 Where are important new questions?

22 New neighbouring disciplines Translational Medicine A branch of medical research that attempts to more directly connect basic research to patient care. Translational medicine is growing in importance in the healthcare industry, and is a term whose precise definition is in flux. In particular, in drug discovery and development, translational medicine typically refers to the "translation" of basic research into real therapies for real patients. The emphasis is on the linkage between the laboratory and the patient's bedside, without a real disconnect. This is often called the "bench to bedside" definition. Computational Systems Biology Computational systems biology aims to develop and use efficient algorithms, data structures and communication tools to orchestrate the integration of large quantities of biological data with the goal of modeling dynamic characteristics of a biological system. Modeled quantities may include steady-state metabolic flux or the time-dependent response of signaling networks. Algorithmic methods used include related topics such as optimization, network analysis, graph theory, linear programming, grid computing, flux balance analysis, sensitivity analysis, dynamic modeling, and others. Neuro-informatics Neuroinformatics combines neuroscience and informatics research to develop and apply the advanced tools and approaches that are essential for major advances in understanding the structure and function of the brain

23 Translational Medicine “From bench to bed side” Genomics data to patient data Integration

24 Natural progression of a gene

25 TERTIARY STRUCTURE (fold) Genome Expressome Proteome Metabolome Functional Genomics From gene to function

26

27

28 Systems Biology is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behaviour of that system (for example, the enzymes and metabolites in a metabolic pathway). The aim is to quantitatively understand the system and to be able to predict the system’s time processes the interactions are nonlinear the interactions give rise to emergent properties, i.e. properties that cannot be explained by the components in the system

29 Systems Biology understanding is often achieved through modeling and simulation of the system’s components and interactions. Many times, the ‘four Ms’ cycle is adopted: Measuring Mining Modeling Manipulating

30 Neuroinformatics Understanding the human nervous system is one of the greatest challenges of 21st century science. Its abilities dwarf any man-made system - perception, decision-making, cognition and reasoning. Neuroinformatics spans many scientific disciplines - from molecular biology to anthropology.

31 Neuroinformatics Main research question: How does the brain and nervous system work? Main research activity: gathering neuroscience data, knowledge and developing computational models and analytical tools for the integration and analysis of experimental data, leading to improvements in existing theories about the nervous system and brain. Results for the clinic: Neuroinformatics provides tools, databases, models, networks technologies and models for clinical and research purposes in the neuroscience community and related fields.

32

33 Bioinformatics algorithms For problems such as alignenment, secondary/tertiary structure prediction, phylogenetic tree determination, etc. Algorithmic main components: Search function Scoring function

34 Bioinformatics algorithms Search function –Search space can be large –High time complexity of algorithms, or even NP-complete/NP-hard problems Scoring function –Also called the Objective Function –Often the most important

35 Pair-wise alignment Complexity of the problem Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2   n 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~10 600 alignments! T D W V T A L K T D W L - - I K

36 Levinthal’s paradox (1969) Denatured protein refolds in ~ 0.1 – 1000 seconds Protein with e.g. 100 amino acids each with 2 torsions (  en  ) Each can assume 3 conformations (1 trans, 2 gauche) 3 100x2  10 95 possible conformations! Or: 100 amino acids with 3 possibilities in Ramachandran plot (  coil): 3 100 » 10 47 conformations If the protein can visit one conformation in one ps (10 -12 s) exhaustive search costs 10 47 x 10 -12 s = 10 35 s  10 27 years! (the lifetime of the universe  10 10 years…)

37 Phylogenetic tree combinatorial explosion Number of unrooted trees = Number of rooted trees =

38 Combinatoric explosion # sequences# unrooted# rooted trees trees 211 313 4315 515105 6105945 794510,395 810,395135,135 9135,1352,027,025 102,027,02534,459,425

39 A recap on Scoring and Benchmarking QUERY DATABASE True Positive True Negative True Positive False Positive True Negative False Negative T POSITIVES NEGATIVES

40

41

42

43

44

45

46

47

48

49

50

51

52

53 Evaluating multiple alignments (MSAs) Conflicting standards of truth –evolution –structure –function With orphan sequences no additional information Benchmarks depending on reference alignments Quality issue of available reference alignment databases Different ways to quantify agreement with reference alignment (sum-of-pairs, column score) “Charlie Chaplin” problem Charlie Chaplin once joined a Charlie-Chaplin competition in disguise and became third. What does this tell you about the jury’s ‘objective function’ ?

54 Evaluation measures QueryReference Column score ‘strict’ measure Sum-of-Pairs score more lenient measure What fraction of the matched amino acid pairs (or alignment columns) in the reference MSA are recreated in the query MSA?

55 Scoring a single MSA with the Sum-of-pairs (SP) score Sum-of-Pairs score Calculate the sum of all pairwise alignment scores This is equivalent to taking the sum of all matched a.a. pairs This can be done using gap penalties or not Good alignments should have a high SP score, but it is not always the case that the true biological alignment has the highest score.

56 BAliBASE benchmark alignments Thompson et al. (1999) NAR 27, 2682. 8 categories: cat. 1 - equidistant cat. 1 - equidistant cat. 2 - orphan sequence cat. 2 - orphan sequence cat. 3 - 2 distant groups cat. 3 - 2 distant groups cat. 4 – long overhangs cat. 4 – long overhangs cat. 5 - long insertions/deletions cat. 5 - long insertions/deletions cat. 6 – repeats cat. 6 – repeats cat. 7 – transmembrane proteins cat. 7 – transmembrane proteins cat. 8 – circular permutations cat. 8 – circular permutations

57 Evaluating multiple alignments You can score a single MSA using the sum of all matched amino acid pairs score. This is also referred to as the Sum-of-Pairs (SP) score.. (a bit confusing with the SP score for comparing a query alignment with a reference alignment )

58 Evaluating multiple alignments  SP BAliBASE alignment nseq * len Many test alignments have a higher SP score than the reference alignment (“Charlie Chaplin problem’)

59 Evaluating multiple alignments Many test alignments have a higher SP score than the reference alignment (“Charlie Chaplin problem’)

60 Comparing T-coffee with other methods Column scores are used here

61 BAliBASE benchmark alignments If you are a better program on average, this does not mean you win in all cases… How do you know what is the situation in your case? Even with a better program you can be unlucky..

62 Example from course DPSFAP: inverse folding Top score structure 20 a.a. fragments in the high specificity regions -- Sequence: 3icb (residues 31–50)‏ Protein Starting position Score C  r.m.s.d. Secondary structure (DSSP)‏ to native (A° ) 3icb 31 –7.36 0.00 HHHHH TTTSSSSS HHHHH 1bbk B 32 –6.18 5.65 GGT SSS TT EE S E 1ezm 254 –5.93 4.61 HHHHT TT HHHHHHHHH 8cat A 73 –5.84 8.68 SEEEEEEEEEE S TTT 3enl 196 –5.84 3.82 HHHHHH GGGG B TTS B 1tie 59 –5.75 6.17 EESS SS TT EEEEES 3gap A 97 –5.73 3.11 EEHHHHHHHTTT TTTHHHH 1tfd 71 –5.59 6.50 EEEEEEE S SSS S E 1gsr A 159 –5.54 2.93 HHHHH TTTTTT HHHHHHH 1apb 149 –5.53 4.14 HHHHHHHHHHHHTT GGGE Random 5.88 A°  The native structure is on top

63 Top-scoring structural 20 a.a. fragments in regions where the native state does not have lowest scores but the C  RMSDs are low -- Sequence: 3icb (residues 36–55) Protein Starting position Score C  r.m.s.d.Secondary structure (DSSP)‏ to native (A° ) 1mba 75 –9.54 3.16 HHHHTT HHHHHHHHHHHHH 1mbc 72 –8.59 3.84 HHHHTTT TTTHHHHHHHHH 3gap A 102 –8.43 3.54 HHHHTTT TTTHHHHHHHHH 1ezm 186 –7.83 5.44 ETTTTBSSS SEESSSGGG 1hmd A 67 –7.47 4.76 TTHHHHHHHHHHHHHHHHT 1sdh A 37 –7.42 4.65 HHHHHHH GGGGGGGGGG 2ccy A 36 –7.34 4.38 TTHHHHHHHHHHHHHHGGG 1ama 298 –7.11 2.67 HHHHHHSHHHHHHHHHHHHH 3icb 36 –7.08 0.00 TTTSSSSS HHHHHHHH S 1pbx A 30 –7.06 4.79 HHHHHHH GGGGGGSTTSS Random RMSD: 5.79 A°  The native structure is not on top

64 Inaccurate scoring functions Since most scoring functions do not rank solutions properly, often ensembles of predictions are taken –For example, take the top 10 (or top 5%) predicted structures, is the true structure among those? –In alignment: next to optimal (highest scoring) alignment, the ensemble of supoptimal alignments is taken Biology is also capricious, e.g. the native protein structure does not always have the lowest internal energy

65 Integrate data sources Integrate methods Integrate data through method integration (biological model) Integrative bioinformatics

66 Data Algorithm Biological Interpretation (model) tool Integrative bioinformatics Data integration

67 Data 1Data 2Data 3

68 Integrative bioinformatics Data integration Data 1 Algorithm 1 Biological Interpretation (model) 1 tool Algorithm 2 Biological Interpretation (model) 2 Algorithm 3 Biological Interpretation (model) 3 Data 2Data 3

69 “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) “Nothing in Bioinformatics makes sense except in the light of Biology” Bioinformatics


Download ppt "Master’s course Bioinformatics Data Analysis and Tools Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW C E N T."

Similar presentations


Ads by Google