Presentation is loading. Please wait.

Presentation is loading. Please wait.

Department of Computer Science University of Bari Knowledge Acquisition & Machine Learning Lab CILC 2006 Convegno Italiano di Logica Computazionale 26-27.

Similar presentations


Presentation on theme: "Department of Computer Science University of Bari Knowledge Acquisition & Machine Learning Lab CILC 2006 Convegno Italiano di Logica Computazionale 26-27."— Presentation transcript:

1 Department of Computer Science University of Bari Knowledge Acquisition & Machine Learning Lab CILC 2006 Convegno Italiano di Logica Computazionale 26-27 giugno 2006, Dipartimento di Informatica, Bari Learning for Biomedical Information Extraction with ILP Margherita BerardiVincenzo GiulianoDonato Malerba

2 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Outline of the talk IE for Biomedicine Looking around IE problem formulation which representation model on data? which features? which framework for reasoning? Mutual Recursion in IE Text processing & domain knowledge Application to studies on mitochondrial genome Conclusions & Future work

3 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari What is Information Extraction Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

4 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari What is Information Extraction Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

5 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari IE from Biomedical Texts: Motivation n Complexity of biological systems: Too many specialized biological tasks Several entities interacting in a single phenomenon Many conditions to simultaneously verify n Complexity of biomedical languages: Several nomenclatures, dictionaries, lexica tending to quickly become obsolete Too much to read! Genome decoding increasing amount of published literature

6 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari IE History n Message Understanding Conference (MUC) DARPA [87-95], TIPSTER [92-96] n Most early work dominated by hand-built models E.g. SRIs FASTUS, hand-built FSMs. But by 1990s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek 97], BBN [Bikel et al 98] Wrapper Induction: initially hand-build, then ML [Soderland 96], [Kushmeric 97], … n Most learning attempts based on statistical approaches Learning of production rules constrained by probability measures (e.g., HMMs, Probabilistic Context-free Grammars) n Some recent logic-based approaches Rapier (Califf 98) SRV (Freitag 98) INTHELEX (Ferilli et al. 01) FOIL-based (Aitken 02) Aleph-based (Goadrich et al. 04)

7 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Learning Language in biomedicine n BioCreAtIvE - Critical Assessment for Information Extraction in Biology (http://biocreative.sourceforge.net/)http://biocreative.sourceforge.net/ n BioNLP, Natural language processing of biology text (http://www.bionlp.org)http://www.bionlp.org n ACL/COLING Workshops on Natural Language Processing in Biomedicine n SIGIR Workshops on Text Analysis for Bioinformatics n Special Interest Group in Text Mining since ISMB03 (Intelligent Systems for Molecular Biology): BioLINK (Biology Literature, Information and Knowledge) n PSB (Pacific Symposium on Biocomputing) tracks n Genomic tracks in TREC (Text Retrieval Conference) n PASCAL challenges on information extraction http://nlp.shef.ac.uk/pascal/http://nlp.shef.ac.uk/pascal/ n Workshops: IJCAI, ECAI, ECML/PKDD, ICML (Learning Language in Logic since 99, challenge task on Extracting Relations from Biomedical Texts)

8 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Is there Logic in language learning? IE systems limitations, in general: Portability (domain-dependent, task-dependent) Scalability (work well on relevant data) Statistics-based approaches wide coverage, scalability, no semantics, no domain knowledge Logic-based approaches: natural encoding of natural language statements and queries in first- order logic, human-comprehensible models, domain knowledge refinement of models [R. J. Mooney, Learning for Semantic Interpretation: Scaling Up Without Dumbing Down, ICML Workshop on Language Learning in Logic, 1999]

9 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari IE problem formulation for HmtDB HmtDB resource of variability data associated to clinical phenotypes concerning human mithocondrial genome (http://www.hmdb.uniba.it/)http://www.hmdb.uniba.it/

10 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Textual Entity Extraction Ex: Cytoplasts from two unrelated patients with MELAS (mitochondrial myopathy, encephalopathy, lactic acidosis, and strokelike episodes) harboring an A-*G transition at nucleotide position 3243 in the tRNALeU(UUR) gene of the mitochondrial genome were fused with human cells lacking endogenous mitochondrial DNA (mtDNA) pathology associated to the mutation under study, substitution that causes the mutation, type of the mutation, position in the DNA where the mutation occurs, gene correlated to the mutation. By modelling the sentence structure: substitution(X) follows (Y,X), type (Y) Extractors cannot be learned independently!!!

11 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Textual Entity Extraction Each entity is characterized by some slots defining a template The task is to learn rules to fill slots (template filling) Relations in data may allow: intra-template dependencies to be learned context-sensitive application of extractors Mutation Sampled population DNA sample tissue DNA screening method … Title Abstract Introduction Methods

12 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari The learning task n Classification Each class (slot) is a concept (target predicate), each model (template filler) induced for the class is a logical theory explaining the concept (set of predicate definitions) Predefined models of classification should be provided Importance of domain knowledge and first-order representations Usefulness of mutual recursion (concept dependencies) ILP = Inductive Learning Logic Programming From IL: inductive reasoning from observations and background knowledge From LP: first-order logic as representation formalism

13 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari ATRE (Apprendimento di Teorie Ricorsive da Esempi) http://www.di.uniba.it/~malerba/software/atre/ Given a set of concepts C 1, C 2,..., C r a set of objects O described in a language L O a background knowledge BK described in a language L BK a language of hypotheses L H that defines the space of hypotheses S H a users preference criterion PC Find a (possibly recursive) logical theory T for the concepts C 1, C 2,..., C r, such that T is complete and consistent with respect to the set of observations and satisfies the preference criterion PC.

14 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari ATRE Main Characteristics Learning problem: induce recursive theories from examples ILP setting: learning from interpretations Observation language: ground multiple-head clauses Hypothesis language: non-ground definite clauses Constraints: linkedness + range-restrictedness Generalization model: generalized implication Search strategy for a recursive theory: separate-and- parallel-conquer Continuous and discrete attributes and relations Background knowledge: intensionally defined

15 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari …the learning strategy… Example: Parallel search for the predicates even and odd seeds even(0)odd(1) Simplest consistent clauses are found first, independently of the predicates to be learned

16 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari …the learning strategy… Example: Parallel search for the predicates even and odd seeds even(2)odd(1) even(X) succ(Y,X) even(X) succ(X,Y) odd(X) succ(Y,X) odd(X) succ(X,Y) A predicate dependency is discovered! even(X) succ(Y,X), succ(Z,Y) odd(X) succ(Y,X), even(Y) odd(X) succ(Y,X), zero(Y) even(X) succ(X,Y), succ(Y,Z)

17 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Data preparation ATREs observation language: multiple-head clauses Enumeration of positive and negative examples (expert users manual annotations + unlabelled tokens) Descriptions of examples: which features? Statistical (frequencies) Lexical (alphanumeric, capitalized, …) Syntactical (nouns, verbs, adjectives, …) Domain-specific (dictionaries)

18 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

19 Text processing n The GATE (A General Architecture for Text Engeneering) framework (http://gate.ac.uk/)http://gate.ac.uk/ n ANNIE is the IE core: Tokeniser Sentence Splitter POS tagger Morphological Analyser Gazetteers Semantic tagger (JAPE transducer) Orthomatcher (orthographic coreference) n Some domain specific gazetteers have been added (diseases, enzymes, genes, methods of analysis)

20 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Text processing n Some reg. expr. to capture some domain specific patterns (alphanumeric strings, appositions, etc.) n Shallow acronym resolution Screening operations: n Some POSs (nouns, verbs, adjectives, numbers, symbols) n Punctuation n stopwords (glimpse.cs.arizona.edu. ) Stemming (Porter)

21 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Text description n word_to_string(token) Numerical: n lenght(token), word_frequency(token), distance_word_category(token1,token2) Structural: n s_part_of(token1,token2), first(token), last(token), first_is_char(token), first_is_numeric(token), middle_is_char(token), middle_is_numeric(token), last_is_char(token), last_is_numeric(token), single_char(token), follows(token1,token2) Lexical: n type_of(token), type_POS(token) Domain dependent: n word_category(token)

22 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Application We considered 71 documents selected by biologists Expert users manually annotated occurrences of entities of interest, namely Mutation: position, type, substitution, type_position, locus Subjects: nationality, method, pathology, category, number The extraction process (both learning and recognition) is locally performed to text portions of interest, automatically classified

23 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Textual portions of papers were categorized in five classes: Abstract, Introduction, Materials & Methods, Discussion and Results The abstract of each paper was processed Avg. No. of categories correctly classified

24 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari An A-to-G mutation at nucleotide position (np) 3243 in the mitochondrial tRNALeu(UUR) gene is closely associated with various clinical phenotypes of diabetes mellitus. [annotation(3)=substitution, annotation(4)=no_tag, annotation(5)=no_tag, annotation(6)=no_tag, annotation(7)=position, annotation(8)=no_tag, annotation(9)=locus, annotation(10)=no_tag, annotation(11)=no_tag, annotation(12)=no_tag, annotation(13)=no_tag, annotation(14)=no_tag, annotation(15)=no_tag, annotation(16)=pathology], [part_of(1,2)=true, contain(2,3)=true, …, contain(2,16)=true, word_to_string(3)=A-to-G', word_to_string(4)='mutation', word_to_string(5)='nucleotid', word_to_string(6)='position',word_to_string(7)='3243', word_to_string(8)='mitochondri', word_to_string(9)='trnaleu(uur)', word_to_string(10)='gene', word_to_string(11)='clos', word_to_string(12)='associat', word_to_string(13)='variou', word_to_string(14)='clinic', word_to_string(15)='phenotyp', word_to_string(16)='diabetes_mellitus', type_of(3)=upperinitial, …, type_of(7)=numeric, type_POS(3)=jj, type_POS(4)=nn, …, type_POS(15)=nns, word_frequency(3)=3, word_frequency(4)=6, …, word_frequency(16)=1, word_category(9)=locus, word_category(16)=disease, distance_word_category(9,16)=1, follows(3,4)=true, follows(4,5)=true,…, follows(14,15)=true, follows(15,16)=true]). Example description

25 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Background knowledge follows(X,Z) follows(X,Y)=true, follows(Y,Z)=true char_number_char(X)=true first_is_char(X)=true, middle_is_numeric(X)=true, last_is_char(X)=true number_char_char(X)=true first_is_numeric(X)=true, middle_is_char(X)=true, last_is_char(X)=true char_char_number(X)=true first_is_char(X)=true, middle_is_char(X)=true, last_is_numeric(X)=true Domain knowledge: word_to_string(X)=transition word_to_string(X)=transversion word_to_string(X)=substitution word_to_string(X)=replacement

26 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Experiments n Mutation template n 6-fold cross validation n The user manually annotates 355 tokens (8.65 per abstract) n About 11% positives

27 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Experiments

28 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Learned theories annotation(X1)=position follows(X2,X1)=true, type_of(X1)=numeric, follows(X1,X3)=true, word_category(X3)=gene, word_to_string(X2)=position. annotation(X1)=type follows(X1,X2)=true, word_frequency(X2) in [8..140], follows(X3,X1)=true, annotation(X3)=substitution annotation(X1)=position follows(X2,X1)=true, annotation(X2)=substitution, follows(X3,X1)=true, follows(X1,X4)=true, word_frequency(X4) in [6..6], annotation(X3)=type, follows(X1,X5)=true, annotation(X5)=locus, word_frequency(X1) in [1..2]

29 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Wrap-up IE in Biomedicine The ILP approach to IE within a multi-relational framework allows to implicitly define Domain knowledge Learning from users interaction Relational representations Learning relational patterns to allow context-sensitive application of models Recursive Theory Learning in IE: ATRE Efforts on text processing level: Ambiguities Data sparseness Noise Encouraging results on a real-world data set

30 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Where from here? Test on available corpus for Bio IE Genia BioCreative NLPBA Genic interaction challenges Investigation of semisupervised approaches: online extension of dictionaries How to encapsulate taxonomical knowledge? Can information extracted by ATRE be really used as background knowledge for genomic database mining?

31 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari n Data sparseness n Om + di com=il sistema nn regge le varietà morfosint n Locus e position=wordtostring modelli specifici

32 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Textual Pattern Extraction immortal cells have lost their growth regulatory mechanisms and, thus, continue to divide indefinitely. abstract(11695244). contain_vx(11695244,'lose'). contain_nx(11695244,n1). word(n1,'immort'). word(n1,'cell'). close_to(n1,'immort','cell'). contain_nx(11695244,n2). word(n2,'growth'). word(n2,'regulatori'). word(n2,'mechan'). close_to(n2,'growth','regulatori'). close_to(n2,'regulatori','mechan'). subject_object(n1,n2). contain_vx(11695244,'divide'). Goal: to find descriptions of texts belonging to the abstract class Task relevant objects: Nominal chuncks, Words Reference object: abstract

33 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Language bias A language bias has been defined in ATRE to allow users to suggest initial models that the learned theory has to satisfy. Example declarations can be used to specify language biases: n starting_number_of_literals(p, N) n starting_clause(p, [L1,L2,…,LN]) n starting_literal(p, [L1,L2,…,LN])

34 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Efficiency issues in ATRE n Caching the structure of already explored search space as much as possible: clauses generated during the i-th learning step are saved and reused at the (i+1)-th learning step some pruning and grafting operations are used to adapt previously explored hierarchies of clauses for current learning step n Caching for clause evaluation: saving much of the computational effort spent to find the positive and negative examples covered by each generated clause It can be applied only for independent clauses, since, their positive/negative examples can decrease or remain unchanged (but not increase)

35 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari The learning strategy… The basic idea Stepwise construction of a recursive theory T T 0 =, T 1, …, T i, T i+1, …, T n =T such that: T i+1 = T i {C} for some clause C LHM(T i ) LHM(T i+1 ), i {0, 1,, n-1} pos(LHM(T i+1 )) > pos(LHM(T i )) for each 1 i n neg(LHM(T i )) = 0 for each 1 i n

36 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari …the learning strategy… 1) pos(LHM(T i+1 )) > pos(LHM(T i )) for each 1 i n Choose at least one seed for each predicate p to be learned, namely a positive example e + of p such that e + LHM(Ti). Explore the space of clauses more general than e + looking for C such that neg(LHM(T i {C})) = 0 2)neg(LHM(T i )) = 0 for each 1 i n Select the best consistent clause and apply the layering technique whenever global inconsistency arises Variation of the classical separate-and-conquer strategy

37 CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari The ILP approach to Data Mining n Relational representations n Domain knowledge n Learning from users interaction n Learning relational patterns to allow context-sensitive application of models ILP = Inductive Learning Logic Programming From IL: inductive reasoning from observations and background knowledge From LP: first-order logic as representation formalism


Download ppt "Department of Computer Science University of Bari Knowledge Acquisition & Machine Learning Lab CILC 2006 Convegno Italiano di Logica Computazionale 26-27."

Similar presentations


Ads by Google