A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications Robert Gaizauskas Natural Language Processing Group Departments of.

A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications Robert Gaizauskas Natural Language Processing Group Departments of Computer Science, University of Sheffield

September 4, 2002 Euromap Text Mining Seminar Outline of Talk Text Mining: Scenario, Definitions and Brief History Text Mining Tasks + Methodologies Text Mining Technologies Text Mining Prototype Applications Conclusions and Future Directions/Challenges

September 4, 2002 Euromap Text Mining Seminar Text Mining: Scenario

September 4, 2002 Euromap Text Mining Seminar Text Mining Scenario Components: Texts Genres Newspapers Company reports Web pages Scientific papers Legal documents E-Formats Word Documents (.doc,.rtf) PDF/Postscript HTML/SGML/XML Languages English … French … Greek … Russian … Chinese … Hindi … Sanskrit … Linear B Character encodings: ASCII, ISO 8859, Unicode

September 4, 2002 Euromap Text Mining Seminar Text Mining Scenario Components: Users User domain of interest Business – competitor intelligence, corporate intranet/memory Scientists – access to literature Military/police intelligence – open source intelligence, intranet Journalists – news archives User level of expertise Novice/expert User linguistic competence Adult/child Native/non-native language speaker Uni/multi-lingual

September 4, 2002 Euromap Text Mining Seminar Text Mining Scenario Components: Information Access Needs Ad hoc searching Specific questions: “What year did the Berlin Wall come down?” General background/context: “Tell me about Zakopane” Stable intelligence gathering Scenario-related: “Build a database recording new projects in the energy sector: the players, location, energy type, start date, capitilisation” Entity-related: “Build a database of key scientists in the pharma industry: name, employer, position, start and end dates” Current awareness Alerting: “Let me know when any papers are published on the crystallographic structure of any lipase” Document selection: “Assemble articles on drug approvals” Summarisation Single/multi-document: “Summarise the Bulger trial”

September 4, 2002 Euromap Text Mining Seminar Text Mining Scenario Components: Tools Information retrieval

September 4, 2002 Euromap Text Mining Seminar What is Information Extraction? The Information Extraction (IE) task: from each text in a set of natural language texts extract information about predefined classes of entities and relationships and place this information into a template or database record. E.g. from financial newswire stories identify those dealing with management succession events and from these extract details of organisations and persons, the post being assumed or vacated, the reason for vacancy, etc. IE may also be described as the activity of populating a structured information repository (database) from an unstructured, or free text, information source.

September 4, 2002 Euromap Text Mining Seminar What is Information Extraction? (cont) The resulting structured database is then used for some other purpose: searching or analysis using conventional database queries; data-mining; generating a summary (perhaps in another language); constructing indices into/within/between the source texts.

September 4, 2002 Euromap Text Mining Seminar Example: A Wall Street Journal Article wsj94_008.0212 940413-0062. Who's News: @ Burns Fry Ltd. 04/13/94 WALL STREET JOURNAL (J), PAGE B10 MER SECURITIES (SCR) BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years old, was named executive vice president and director of fixed income at this brokerage firm. Mr. Wright resigned as president of Merrill Lynch Canada Inc., a unit of Merrill Lynch & Co., to succeed Mark Kassirer, 48, who left Burns Fry last month. A Merrill Lynch spokeswoman said it hasn't named a successor to Mr. Wright, who is expected to begin his new position by the end of the month.

September 4, 2002 Euromap Text Mining Seminar Example: A Management Succession Event Template := DOC_NR: "NUMBER" ^ CONTENT: * := ORGANIZATION: ^ POST: "POSITION TITLE" | "no title" ^ IN_AND_OUT: + VACANCY_REASON: {DEPART_WORKFORCE, REASSIGNMENT, NEW_POST_CREATED, OTH_UNK} ^ := PERSON: ^ NEW_STATUS: {IN, IN_ACTING, OUT, OUT_ACTING} ^ ON_THE_JOB: {YES, NO, UNCLEAR} OTHER_ORG: - REL_OTHER_ORG: {SAME_ORG, RELATED_ORG, OUTSIDE_ORG} - := ORG_NAME: "NAME" - ORG_ALIAS: "ALIAS" * ORG_DESCRIPTOR: "DESCRIPTOR" - ORG_TYPE: {GOVERNMENT, COMPANY, OTHER} ^ ORG_LOCALE: LOCALE_STRING {{CITY, PROVINCE, COUNTRY, REGION, UNK} * ORG_COUNTRY: NORMALIZED-COUNTRY-or-REGION | COUNTRY-or-REGION-STRING * := PER_NAME: "NAME" - PER_ALIAS: "ALIAS" * PER_TITLE: "TITLE" *

September 4, 2002 Euromap Text Mining Seminar := DOC_NR: "9404130062" CONTENT: := SUCCESSION_ORG: POST: "executive vice president" IN_AND_OUT: VACANCY_REASON: OTH_UNK := := IO_PERSON: IO_PERSON: NEW_STATUS: OUT NEW_STATUS: IN ON_THE_JOB: NO ON_THE_JOB: NO OTHER_ORG: REL_OTHER_ORG: OUTSIDE_ORG := := ORG_NAME: "Burns Fry Ltd.“ ORG_NAME: "Merrill Lynch Canada Inc." ORG_ALIAS: "Burns Fry“ ORG_ALIAS: "Merrill Lynch" ORG_DESCRIPTOR: "this brokerage firm“ ORG_DESCRIPTOR: "a unit of Merrill Lynch & Co." ORG_TYPE: COMPANY ORG_TYPE: COMPANY ORG_LOCALE: Toronto CITY ORG_COUNTRY: Canada := := PER_NAME: "Mark Kassirer" PER_NAME: "Donald Wright" PER_ALIAS: "Wright" PER_TITLE: "Mr." Example: A (Partially) Filled Management Succession Event Template

September 4, 2002 Euromap Text Mining Seminar Example: Uses for Templates From the completely filled version of the preceding template a natural language summary can be generated: BURNS FRY Ltd. named Donald Wright as executive vice president. Donald Wright resigned as president of Merrill Lynch Canada Inc.. Mark Kassirer left as president of BURNS FRY Ltd. Or, a table can be constructed:. CompanyPostPersonDirection Burns FryExecutive VPDonald WrightIn Burns FryPresidentMark KassirerOut Merrill Lynch Canada PresidentDonald WrightOut

September 4, 2002 Euromap Text Mining Seminar Key Features of Information Extraction Texts are unrestricted NL, but typically short Template is predefined and fixed Information extracted is `literal' or `factual‘ The precise definition of the task permits quantitative evaluation of IE systems' performance against human generated results

September 4, 2002 Euromap Text Mining Seminar What IE is NOT: Information Retrieval The Information Retrieval (IR) task: given a user query and a document collection retrieve that subset of documents from the collection which are relevant to the user's query. E.g. given the query exonuclease gamma-delta resolvase return those abstracts in PubMed pertaining to these proteins Once the IR system returns the documents, the user browses the selected documents in order to fulfil his or her information need. Depending on the IR system, the user may be further assisted by relevance ranking of retrieved documents highlighting of search terms in the text to facilitate identifying passages of particular interest

September 4, 2002 Euromap Text Mining Seminar Strengths and Weaknesses of IR Strengths: Can search huge document collections very rapidly Insensitive to genre and domain of the texts Can rank documents with respect to likely relevance Searches can be iteratively refined Weaknesses: Documents are returned not information/answers, so user must further read texts to extract information Frequently not discriminating enough (“1563 documents match your request”)

September 4, 2002 Euromap Text Mining Seminar Strengths and Weaknesses of IE Strengths: Extracts facts from texts, not just texts from text collections Can feed other powerful applications (databases, indexing engines) Weaknesses: Porting to new genres and domains is time-consuming and requires expert Limited accuracy Not fast enough to run over large text collections while user waits

September 4, 2002 Euromap Text Mining Seminar A Brief History of IE The first published work on information extraction (though it was not called this at the time) was in late 1960s A significant precursor was the psychologist Roger Schank’s work on scripts and story understanding in the 1970’s The 1980’s saw the emergence of some commercial systems targetted at financial transactions and newswires The big impetus to current research started in the late 1980’s when DARPA initiated a series of competitive evaluations of “Message Understanding” systems (Message Understanding Conferences – MUC) MUC ran for 10 years (1987-98) and significantly advanced the field Currently there are a number of IE systems on the market and a large and on-going research effort in the field

September 4, 2002 Euromap Text Mining Seminar Outline of Talk Text Mining: A Definition and Brief History Text Mining Tasks + Methodologies Entity Extraction Attribute Extraction Relation Extraction Event Extraction Text Mining Technologies Text Mining Prototype Applications Conclusions and Future Directions/Challenges

September 4, 2002 Euromap Text Mining Seminar IE Component Tasks To fill templates IE researchers have discovered that systems must be able to perform a variety of simpler tasks Studying and evaluating these component tasks in isolation has proved a useful way forward for IE Component IE tasks which were specified as part of MUC: Named Entity Recognition (persons, organisations,locations, dates) Coreference (multiple references to same entity) Template Elements (organisations, persons, artifacts, locations) Template Relations (employee_of, product_of, location_of) Scenario Template (management succession)

September 4, 2002 Euromap Text Mining Seminar MUC Scoring and Scoring Metrics Correct answers, called keys, are produced manually for all the MUC tasks. Scoring of system results, called responses, against keys is done automatically. At least some portion of the answer keys are multiply produced by different humans so that interannotator agreement figures can be computed. Interannotator agreement figures of 95% are sought. Figures of less than 80% are interpreted as meaning the task is not sufficiently clearly defined. Principal metrics are: Precision (how much of what your system returns is correct) Recall (how much of what is correct your system returns) F-measure (a weighted combination of precision and recall)

September 4, 2002 Euromap Text Mining Seminar State-of-the-art Evaluation Results (MUC-7) TaskRecallPrecisionP & R Named Entity929593.39 Coreference56.168.861.8 Template Element 868786.76 Template Relation 678675.63 Scenario Template 426550.79

September 4, 2002 Euromap Text Mining Seminar Outline of Talk Text Mining: A Definition and Brief History Text Mining Tasks + Methodologies Entity Extraction Attribute Extraction Relation Extraction Event Extraction Text Mining Technologies Text Mining Prototype Applications Conclusions and Future Directions/Challenges

September 4, 2002 Euromap Text Mining Seminar Applying IE to Biological Science Journal Papers IE is an appropriate technology when: large volumes of text make human analysis infeasible template-oriented information seeking is appropriate (stable information need, narrow domain) conventional IR is inadequate some error is tolerable To date most IE applications are newswire-oriented, with the bulk being in the financial/competitor intelligence area Bioinformatics applications provide an interesting challenge to IE different text types -- journal papers (SGML/PDF), abstracts (BIDS, MEDLINE) different genre -- scientific writing different domain -- biochemistry/molecular biology

September 4, 2002 Euromap Text Mining Seminar EMPathIE: Enzyme and Metabolic Pathways Information Extraction Aim: Use IE techniques to create a database of enzyme and metabolic pathway data from academic journal papers to support drug discovery Partners: Depts of Computer Science and Information Studies, U. of Sheffield; Glaxo-Wellcome Research; Elsevier Science Sponsors: Glaxo-Wellcome Research; Elsevier Science PostDoc: Dr. Kevin Humphreys Status: Complete. Project ran 11/97 -- 11/99

September 4, 2002 Euromap Text Mining Seminar EMPathIE: Scenario metabolic processes involve biochemical reactions in which enzymes play key catalytic roles each reaction involves an enzyme, some number of inputs and results in some number of products sequences of such reactions form metabolic pathways identifying pathways can suggest potential sites for the application of drugs to affect a particular end result reactions are typically reported one/journal paper -- identifying pathways frequently requires combining information from several papers

September 4, 2002 Euromap Text Mining Seminar EMPathIE: Text Sources Project focused on 13 journal papers from FEMS Letters (Federation of European Microbiological Societies), and Biochimica et Biophysica Acta from 1992-1995 Papers supplied by Elsevier Science and marked up according to their proprietary SGML DTD mark up reliable for bibliographical and text structure information typographical markup (e.g. italics for gene names) inconsistent and hence ignored

September 4, 2002 Euromap Text Mining Seminar Sample EMPathIE Article Federation of European Microbiological Societies Isocitrate lyase activity in halophilic archaea A. Oren and P. Gurevich, The Hebrew University of Jerusalem Abstract: Eight species of halophilic Archaea were tested for the presence of isocitrate lyase activity. High activities (up to 100 nmol –1 mg protein -1) were detected in Haloferax mediterranei and Haloferax volcanii when grown in medium containing acetate as the principal carbon source. Little activity was found in representatives of the genera Halobacterium and Haloarcula. Isocitrate lyase from Haloferax mediterranei required high potassium chloride concentrations, optimal activity being found at 1.5-3 M potassium chloride and pH 7.0. Replacement of potassium chloride by sodium chloride resulted in much lower activities. Sulfhydryl compounds (cysteine, glutathione) were not stimulatory. In other properties (stimulation by magnesium ions, sensitivity to different inhibitors) the enzyme resembled isocitrate lyases from representatives of the Bacteria and Eucarya. Full Text: …

September 4, 2002 Euromap Text Mining Seminar EMPathIE Template Specification := := NAME: "NAME" + NAME: "NAME" + CODE: “EC_CODE" * INTERACTION: + WEIGHT: "WEIGHT" - SUBUNITS: "SUBUNITS" * := ENZYME: ^ := SOURCE: - NAME: "NAME" + PARTICIPANT: * STRAIN: "STRAIN" * NON_PARTICIPANT: * GENUS: "GENUS" - := := COMPOUND: ^ NAME: "NAME" + TYPE: {SUBSTRATE,PRODUCT, SUPPLIER: "SUPPLIER" * ACTIVATOR, COFACTOR, INHIBITOR,BUFFER} ^ := CONCENTRATION: "CONCENTRATION" - ENZYME: ^ TEMPERATURE: "TEMPERATURE" ORGANISM: ^ ACIDITY: "ACIDITY" -

September 4, 2002 Euromap Text Mining Seminar Filled EMPathIE Template ENZYME-1 PATHWAY-1 Name: isocitrate lyase Name: glyoxylate cycle E.C. Code: 4.1.3.1 Interaction: INTERACTION-1 ORGANISM-1 INTERACTION-1 Name: Haloferax volcanii Enzyme: ENZYME-1 Strain: ATCC 29605 Participants: PARTICIPANT-1 Genus: halophilic Archaea PARTICIPANT-2 COMPOUND-1 PARTICIPANT-1 Name: glyoxylate phenylhydrazone Compound: COMPOUND-1 Type: Product COMPOUND-2 Temperature: 35C Name: KCl PARTICIPANT-2 SOURCE-1 Compound: COMPOUND-2 Enzyme: ENZYME-1 Type: Activator Organism: ORGANISM-1 Concentration: 1.75 M

September 4, 2002 Euromap Text Mining Seminar PASTA: Protein Active Site Template Acquisition Aim: Use IE techniques to create a database of protein active site data from academic journal papers and abstracts to support protein structure analysis Partners: Depts of Computer Science, Information Studies, Molecular Biology and Biotechnology, U. of Sheffield Sponsors: BBSRC-EPSRC BioInformatics Programme PostDoc: Dr. George Demetriou Status: Complete. Project ran 03/98 -- 03/01

September 4, 2002 Euromap Text Mining Seminar PASTA: Scenario Extract information concerning the roles of amino acids in protein molecules and create a database of protein active sites from both scientific journal abstracts and full articles New protein structures are being reported at very high rates in the literature

September 4, 2002 Euromap Text Mining Seminar PASTA: Scenario (cont) Full evaluation of the results of protein structure comparisons often requires the investigation of extensive literature references E.g. to determine whether an amino acid has been reported as present in a particular region of a protein Computational methods that can extract information directly from these articles would be very useful to biologists in comparison classification work and to those engaged in modelling studies

September 4, 2002 Euromap Text Mining Seminar Sample PASTA Article (BIDS Abstract) TI: The crystal structure of a triacylglycerol lipase from Pseudomonas cepacia reveals a highly open conformation in the absence of a bound inhibitor AU: Kim_KK, Song_HK, Shin_DH, Hwang_KY, Suh_SW NA: SEOUL NATL UNIV,COLL NAT SCI,DEPT CHEM,SEOUL 151742,SOUTH KOREA SEOUL NATL UNIV,COLL NAT SCI,DEPT CHEM,SEOUL 151742,SOUTH KOREA JN: STRUCTURE, 1997, Vol.5, No.2, pp.173-185 IS: 0969-2126 AB: Background: … Results: We have determined the crystal structure of a triacylglycerol lipase from Pseudomonas cepacia (Pet) in the absence of a bound inhibitor using X-ray crystallography. The structure shows the lipase to contain an alpha/beta-hydrolase fold and a catalytic triad comprising of residues Ser87, His286 and Asp264. The enzyme shares several structural features with homologous lipases from Pseudomonas glumae (PgL) and Chromobacterium viscosum (CvL), including a calcium-binding site. The present structure of Pet reveals a highly open conformation with a solvent-accessible active site. This is in contrast to the structures of PgL and Pet in which the active site is buried under a closed or partially opened 'lid', respectively. Conclusions: …

September 4, 2002 Euromap Text Mining Seminar (Partially) Filled PASTA Template := DOC_JR: "STRUCTURE, 1997, Vol.5, No.2, pp.173-185" DOC_AUTH: "Kim_KK, Song_HK, Shin_DH, Hwang_KY, Suh_SW" DOC_IS: "0969-2126“ := RES_TYPE: SERINE RES_NO: "87" SITE/FUNCTION: "catalytic","hydrolytic activity", "interfacial activation", "stereoselectivity", "calcium-binding site", ”active-site" SEC_STRUCT: A-HELIX QUATERN_STRUCT: - REGION: 'lid' INTERACTION: - := := PRO_NAME: "Triacylglycerol lipase“ SPE_NAME: "Pseudomonas cepacia" PRO_SCOP_FAM: "Lipase“ SPE_NAME_TYPE: SCIENTIFIC PDB_CODE: 1LGY : = := RESIDUE: <RESIDUE-str-1997-5-2-1 PROTEIN: <PROTEIN-str-1997-5-2-1 PROTEIN: SPECIES:

September 4, 2002 Euromap Text Mining Seminar Outcomes I: The PASTA System System processes texts in four principal stages: text preprocessing performs text structure analysis and tokenisation lexical and terminological processing performs morphological analysis, multi-token matching against terminology lexicons, and small-scale parsing using terminology grammars parsing and semantic interpretation splits text into sentences, tags tokens with parts-of-speech, performs partial phrasal parsing and compositional semantic interpretation into a predicate-argument “logical form” discourse interpretation integrates each sentence's predicate- argument representation into a hierarchically structured semantic net encoding the system's domain model A final stage generates template output as required.

September 4, 2002 Euromap Text Mining Seminar PASTA System: Text Preprocessing Text structure analysis Scientific articles typically have a rigid structure, including abstract, introduction, method and materials, results, and discussion sections. Certain sections can be targeted for detailed analysis while others can be skipped completely. Where articles are available in SGML with a DTD, an initial module is used to identify particular markup, specified in a configuration file, for use by subsequent modules. Where articles are in plain text, an initial `sectioniser' module is used to identify and classify significant sections using sets of regular expressions. Tokenisation in addition to the normal white-space/punctuation delimited tokenisation required for newswires, scientific papers require further sophistication: NaCl,Tyr152

September 4, 2002 Euromap Text Mining Seminar PASTA System: Lexical and Terminological processing The main information sources used for terminology identification in the biochemical domain are: case-insensitive terminology lexicons (at present approximately 25,000 component terms in 52 categories -- see next slide) morphological cues, mainly standard biochemical suffixes hand-constructed grammar rules for each terminology class For example, the enzyme name mannitol-1-phosphate5- dehydrogenase would be recognised 1. by the classification of mannitol as a potential compound modifier and phosphate as a compound -- both matched in the terminology lexicon 2. by morphological analysis suggesting dehydrogenase as a potential enzyme head, due to its suffix –ase 3. by domain-specific grammar rules combining the enzyme head with a known compound and modifier which can play the role of enzyme modifier

September 4, 2002 Euromap Text Mining Seminar Biochemical Terminological Lists protein names (trypsin, lipase, etc.) amino acids (Glycine, Phe, etc.) gene names species (human, E.coli, etc.) secondary structure (alpha helix, beta sheet, etc.) supersecondary structure (coiled-coil alpha helix, etc.) quaternary structure (dimer, hexamer, etc.) regions (carboxy-terminal) and sites (glycosylation site, etc) chains (butyl chain, catalytic chain, etc.) interactions (hydrogen bonds, contacts) bases (DNA, RNA) elements (N, Ca, NZ, etc.) non-protein entities (cofactors, substrates, etc.) measure terms (kcal, millimeter, joule, etc.) Principal Term Classes Principal Terminology Resouces Protein Data Bank Enzyme classification SCOP classification CATH classification IUPAC / IUBMB Nomenclature Recommendations

September 4, 2002 Euromap Text Mining Seminar PASTA System: Parsing and Semantic Interpretation The syntactic processing modules treat any terms recognised in the previous stage as non-decomposable units, with a syntactic role of proper noun. As a consequence: The sentence splitting module cannot propose sentence boundaries within a preclassified term. The part-of-speech tagger only attempts to assign tags to tokens which are not part of proposed terms. The phrasal parser treats terms as preparsed noun phrases. Parsing is carried out with a general phrasal (feature-based unification) grammar of English. The phrasal grammar includes compositional semantic rules, which are used to construct a semantic representation of the “best”, possibly partial, parse of each sentence. This predicate logic-like representation is passed on as input to the discourse interpretation stage.

September 4, 2002 Euromap Text Mining Seminar “This cleft contains the putative catalytic residue Glu132 above the core of the beta-barrel.”. PASTA System: Parsing and Semantic Interpretation (cont) Semantic Analysis contain(e1), cleft(e2), lsubj(e1,2),det(e2,this), residue(e3), lobj(e1,e3), name(e3,”Glu32”), adj(e3,putative),adj(e3,catalytic) core(e4),above(e1,e4) secondary_structure(e5),name(e5,”beta-barrel”),of(e4,e5) the putative catalytic residue Det N This cleft S VP V NP PP contains above P NP PP the core of the beta-barrel Syntactic Analysis

September 4, 2002 Euromap Text Mining Seminar PASTA System: Discourse Interpretation The semantic representation of each sentence is added to a predefined domain model made up of an ontology, or concept hierarchy, and inheritable attributes and inference rules associated with concept nodes in the hierarchy The domain model is gradually populated with instances of concepts from the text to become a discourse model A powerful coreference mechanism attempts to merge each newly introduced instance with an existing one, subject to various syntactic and semantic constraints. Inference rules of particular instance types may then fire to hypothesise the existence of instances required to fill a template (e.g. an organism with a source_of relation to an enzyme). The coreference mechanism will then attempt to resolve the hypothesised instances with actual instances from the text – making up for deficiencies in parsing.

September 4, 2002 Euromap Text Mining Seminar PASTA System: Discourse Interpretation (cont) 1. The three-dimensional structure of Endo H has been determined … 2. A shallow curved cleft runs across the surface of the molecule from … 3. This cleft contains the putative catalytic residues Asp130 and Glu132 … From 1, Endo H is identified as a protein – protein(e1),name(e1,”Endo H”) – and added to the discourse model From 2, the cleft is identified – cleft(e23) – and the molecule – molecule(e25) Ontology records that proteins are molecules and coreference resolves e25 and e1 Domain model/ontology records that clefts are regions and that regions are located in proteins – a protein, say e42, is hypothesized and the relation located_in(e23,e42) In the absence of full semantic analysis of “runs across the surface of”, coreference picks the closest protein and resolve e42 with e1/e25 – i.e. the cleft is assumed to be in Endo H. From 3, the analysis is as before – the cleft is identified as, say e52, and the residue, e61 coreference resolves the cleft e52 with the preceding e23 The domain model allows reasoning from “contains” to establish the relation located_in(e61,e23) – the residue is located in the cleft Transitiviy of located_in permits the conclusion: located_in(e61,e1) – Glu132 is in EndoH

September 4, 2002 Euromap Text Mining Seminar Outcomes II: Text Corpora 1500 BIDS abstracts from 24 molecular biology journals from 1994-98 ASCII text ~250 words each structured keyword fields in header 300 full journal paper from Molecular Biology and Structure from 1994-1998 from publishers' websites (HTML/ASCII)

September 4, 2002 Euromap Text Mining Seminar Annotated Corpora Annotated corpora are needed for system development and evaluation For development, PASTA researchers at Sheffield manually prepared terminology-tagged 52 journal article abstracts for the term classes: protein, species residue, site, region, secondary structure, super secondary structure, quaternary structure, chain, base, atom, non- protein, interactions (1376 term occurrences) filled templates derived from 25 abstracts used for training For final blind evaluation, independent domain experts prepared 62 terminology-tagged abstracts for the term classes 20 texts annotated by both annotators interannotator agreement is low (as assessed by MUC scorer) filled templates from 30 abstracts 10 annotated by both annotators

September 4, 2002 Euromap Text Mining Seminar Evaluation To evaluate system’s performance is measured against manually annotated corpora using automatic scorer developed in the DARPA MUC evaluations On development texts terminology evaluation results: Recall: 88% Precision: 94% P & R: 91% In final blind evaluation terminology evaluation results: Recall: 82% Precision: 84% P & R: 83% Template filling evaluation results: Recall: 68% Precision: 71% P & R: 69%

September 4, 2002 Euromap Text Mining Seminar Outcomes III: Browser-based Interface Raw templates or texts annotated with identifiers for protein and residue names are not of much use to the working biologist Most effective delivery platform is a Web-browser Therefore we have designed and implemented a browser-based interface to allow a user to browse the results has added benefit that links to source texts can easily be added – can help to overcome IE system’s errors

September 4, 2002 Euromap Text Mining Seminar Outcomes III: Browser-based Interface

September 4, 2002 Euromap Text Mining Seminar Outcomes IV: Active PASTA – The PASTA Daemon PASTA is being integrated into a web-linked system that automatically on a daily basis retrieves texts related to protein structure from Medline runs the text through PASTA to extract protein/residue/active site information integrates the extracted information into previously extracted tables/indices publishes the results via the PASTA browser-interface on a web-server Result will be a web site accessible by molecular biologists with PASTA-extracted information plus links back to Medline for confirmation/refutation

September 4, 2002 Euromap Text Mining Seminar E-Science: MyGrid MyGrid is an EPRSC-funded E-Science project involving: University of Manchester (Computer Science) EBI – Hinxton University of Southampton (Computer Science) University of Newcastle (Computer Science) University of Nottingham (Computer Science) University of Sheffield (Computer Science) Aim: To build a virtual workbench to support the E-Biologist in performing in silico experiments involving transparent access to distributed Structured data resources (e.g. Swissprot, PDB) Textual data resources (e.g. Medline, On-line journals) Algorithms (e.g. Blast) Processing resources

September 4, 2002 Euromap Text Mining Seminar E-Science: MyGrid (cont) Sheffield will provide text-mining technology (EMPathIE, PASTA) Current activities: Integrating UMLS into terminology processing components Integrating the Gene Ontology into PASTA discourse model (DAML+OIL) Acquiring Medline locally for terminology mining and indexing experiments Making aspects of PASTA available as a Web Service (via SOAP)

September 4, 2002 Euromap Text Mining Seminar Conclusions + Future Work EMPathIE and PASTA demonstrate the challenges encountered and the benefits gained by applying IE techniques to new areas terminology is particularly critical/difficult in this area Evaluation scores are not as high as for MUC-7, but tasks are harder training resources are much more limited Future work includes: Improved techniques for handling terminological variants improved techniques to produce IE system resources automatically or semi-automatically: terminology lists, grammars, domain models/ontologies richer domain modelling

September 4, 2002 Euromap Text Mining Seminar THE END Target extracted templates and relations: Structure equivalence (between different molecules) 3-D relationship (within one structure) PROTEIN: ENR RESIDUE TYPE: Lys RESIDUE NO: 206 2NDRY STRUC: alpha-5 PROTEIN: ENR RESIDUE TYPE: Lys RESIDUE NO: 206 2NDRY STRUC: alpha-5 PROTEIN: HSD RESIDUE TYPE: Tyr RESIDUE NO: 152 PROTEIN: HSD RESIDUE TYPE: Tyr RESIDUE NO: 152 PROTEIN: ENR RESIDUE TYPE: Tyr RESIDUE NO: 198 FUNCTION: base in the ENR reaction mechanism PROTEIN: ENR RESIDUE TYPE: Tyr RESIDUE NO: 198 FUNCTION: base in the ENR reaction mechanism PROTEIN: HSD RESIDUE TYPE: NADH FUNCTION: cofactor PROTEIN: HSD RESIDUE TYPE: NADH FUNCTION: cofactor PROTEIN: ENR RESIDUE TYPE: NADH FUNCTION: cofactor PROTEIN: ENR RESIDUE TYPE: NADH FUNCTION: cofactor PROTEIN: ENR RESIDUE TYPE: Met RESIDUE NO: 202 PROTEIN: ENR RESIDUE TYPE: Met RESIDUE NO: 202 PROTEIN: HSD RESIDUE TYPE: Lys RESIDUE NO: 156 FUNCTION: putative catalytic PROTEIN: HSD RESIDUE TYPE: Lys RESIDUE NO: 156 FUNCTION: putative catalytic “Tyr 152 phenolic O 4A from C4 of nicoti-namide in HSD” “Tyr 198 phenolic O 6A from C4 of nicotinami-de in ENR”

A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications Robert Gaizauskas Natural Language Processing Group Departments of.

Similar presentations

Presentation on theme: "A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications Robert Gaizauskas Natural Language Processing Group Departments of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications Robert Gaizauskas Natural Language Processing Group Departments of.

Similar presentations

Presentation on theme: "A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications Robert Gaizauskas Natural Language Processing Group Departments of."— Presentation transcript:

Similar presentations

About project

Feedback