Presentation on theme: "Logic-statistic modeling and analysis of biological sequence data: A research agenda Henning Christiansen Roskilde University, Denmark ruc.dk,"— Presentation transcript:
Logic-statistic modeling and analysis of biological sequence data: A research agenda Henning Christiansen Roskilde University, Denmark ruc.dk, www. ruc. dk / ~henning International Workshop on Abduction and Induction in AI and Bioinformatics Aix-en-Provence, 15 sep 2007
Motivation and overall goal Computational analysis of biological sequence data traditionally based on –HMM, SCFG, ad hoc techniques –Each system has its particular type of models A bottle-neck which cannot be remedied (only) by faster and parallel computers We want to promote the application of more expressive and flexible models: logic-statistic methods a la PRISM (Sato, Kameya) I.e. stepping from regular and context-free languages to Turing-complete language
To reach this goal, we will... Approach inherent computational problems –optimizations by program analysis & transformations –interface to existing and efficient software Develop biologically relevant test cases –for biologist to learn how to use such models –to have relevant test cases for the first part
Project setup Funded by the NABIIT program under the Danish Strategic Research Council, 2007–2011 Main academic partners: –Roskilde U. Computer Science: H. Christiansen, J. Gallagher; 1 PhD student (still open!), postdocs –Roskilde U., Biology: O. Skovgaard; 1 PhD student –Aalborg U. Computer Science: M. Jaeger Academic associates: –Taisuke Sato, Tokyo Inst. of Techn. –A. Krogh, Copenhagen Univ. Industrial partners: –Chr. Hansen, Denmark. Wordwide supplier of probiotic products for the dietary supplement industry –CLC bio Leading supplier of bioinformatics software
PRISM (Sato, Kameya) for sequence analysis, introduced by a toy example PRISM extends Prolog with discrete random variables Includes machine learning and prediction methods: –learn best probabilities to explain training data –with learned prob’s, determine best answer to a query Example: Loop structures - a non-context-free phenomenon gggctgg Assume a collection of sequences where loop structures have been identified in the lab Task: Build and train model so it can be used for prediction
Example model in PRISM sequence(...):- noise(...), contact(K,....), noise(....), contactCopy(K,...), noise(...). sequence(K,S):- noise(-,S,S1), contact(K,-,-, S1,S2), noise(-,S2,S3), contactCopy(K, S3,S4), noise(-,S4,). Assume (arbitrarily): ‘noise’ ≈ a 1. order Markov model ‘contact zone’ ≈ a 2. order values(moreNoise,[stop, continue]). values(moreContact,[stop,continue]). values(which(_),[a,c,g,t]). values(which(_,_),[a,c,g,t]). noise(F,S1,S2):- msw(moreNoise,YN), noise2(F,S1,S2,YN). noise2(_,S,S,stop). noise2(F,[B|S1],S2,continue):- msw(which(F),B), noise(B,S1,S2). contact(K,F1,F2,S1,S2):- msw(moreContact,YN), contact2(K,F1,F2,S1,S2,YN). contact2(,_,_,S,S,stop). contact2([B|K],F1,F2,[B|S1],S2,continue):- msw(which(F1,F2),B), contact(K,F2,B,S1,S2). contactCopy(,S,S). contactCopy([B|K],[B|S1],S2):- contactCopy(K,S1,S2). This is the entire model! Training data: sequence([c,c,g,g,g,t,c,g,c],[a,c,c,g,g,g,t,c,g,c,a,a,t,c,a,a,a,t,c,t,t,t,a,a,c,c,c,g,g,g,t,c,g,c,a,g,a,c,t,a,t,g,t,t,t,a,g,a,a,a,a,c,a,t]). sequence(......,......). sequence(....,.....)
Using a the trained model for prediction ?- viterbig(sequence(K,[t,a,t,a,g,c,g,c,t,a,t,a,g,c,g,c,t,a,t,a])) K = [g,c,g,c] The answer to the query with highest probability.... plus a lot of other facilities
Our first serious application of PRISM: Testing gene finders (MLDM 2007; with C.M.Dahmcke) Problems: Test data expensive; available test data already used for training gene finders; disagreement about what is a gene,... Approach: –Develop and train PRISM model with known, annotated data –Use this to create artificial test data, i.e., sequences with annotations about where-are-the-genes –Check if gene finder programs find the same genes Results: –Three different gene finders found too many and different genes ;-(
Overview of the model (intergenic only) Target predicate: sequence(sequence-of-ACGT, GC-islands, repeats) GC-islands: list of from-no–to-no repeats: list of from-no–to-no with indication of: type: simple, low-complexity, named,... for named: selected from catalogue; which part; forward, backward, transposed, backward+transposed plus one detailed description of »mutation«: [c,c,c,c,i,i,c,c,d,d,c,c,...] (to suppress complexity in the model; for training data generated by a best-match algorithm) repeat1repeat2repeat3... GC-sparse GC-islandGC-sparse
Implemented as a two-layer model Top-level: GC-islands/GC-sparse, length exponential decay Underlying layer: Mix of repeaters and coloured noise Two-level structure implemented by our own abstract datatype: -uses hidden msw’s to control GC-island/sparse -each RV maintained in two versions (hidden) -position, counter to produce annot. GC-islands msw(RV: random-var, value, GC-islands, position)
Lesson learned from gene finder experiment A nontrivial model can be organized in a reasonable way by an experienced logic programmer Preprocessing to freeze one mutation set reduced complexity of learning phase - general technique? Model could be trained in minutes from marked up sequences of total 10 6 letters. With Prolog’s list repr. for sequences we needed 64-bit architechture (sic!) and lo-o-o-o-ot of RAM PRISM is a very flexible tool for combining and varying different models, inventing a little data structure etc., but keeping a model with well-defined semantics We lacked, and would suggest to add to PRISM –Distributions over integers (normal d. or “generic smooth”) –Over-layered, and especially negative criteria (ouf: random variables become dependent)
Anticipated problems for prediction with PRISM and possible solutions Storage consumption, the sequence as array + PRISM’s explanation graphs (???) Execution time –Systematic approach to pruning: Generalize known methods for semantics-preserving program transformations to semantics-approximating transformations –Integrate with existing and efficient software Automatically and hidden??? If, e.g., analysis of PRISM program says “this-looks-like-a-HMM” Clean interfaces ???? –Reduce complexity by splitting the sequence (-- how to integrate this with a nice semantics?)
Biological problems considered Gene finding in health promoting bacteria Phylogenetic gene prediction Prediction of gene function and acquisition by orthology (Gene finding for eukaryotic species)
Project hypotheses summarized Logic-statistic models, a la PRISM or similar, have much higher expressibility and flexibility than traditional models used for sequence analysis (in formal as well as practical sense) If we can solve some of the computational problems involved and learn how to use such powerful modeling tools, there is a potential for new discoveries in biology.
Thanks for your attention! PS. We are seeking (desperately) a good PhD student for the computational issues. Good salary and conditions offered!