Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 1 Gulla, Brasethvik and Kaada A Flexible Workbench for Document.

Similar presentations


Presentation on theme: "A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 1 Gulla, Brasethvik and Kaada A Flexible Workbench for Document."— Presentation transcript:

1 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 1 Gulla, Brasethvik and Kaada A Flexible Workbench for Document Analysis and Text Mining Jon Atle Gulla, Terje Brasethvik and Harald Kaada Norwegian University of Science and Technology Norway 1.Why a linguistic workbench? 2.How does it work? 3.How to use it? 4.How did we use it? 1.Why a linguistic workbench? 2.How does it work? 3.How to use it? 4.How did we use it? Outline:

2 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 2 Gulla, Brasethvik and Kaada Building Search Engines Need to handle syntactic and morphological variation in documents: –language identification, text categorization, stemming/lemmatization, stopwords Want to modify query to improve search result –stemming/lemmatization, spell-checking, query reformulation with ontologies/dictionaries, grammatical analysis, phrasing, anti-phrasing [FAST search engine (www.alltheweb.com)] Docs Index Retrieve QueryModified Result page

3 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 3 Gulla, Brasethvik and Kaada Extracting Information From Text Structuring knowledge from text –tagging, compounds, grammatical analysis, ontological interpretation, regular expressions, patter recognition Text Database Ontology Minimal recursion semantics representations [Deep Thought EU project]

4 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 4 Gulla, Brasethvik and Kaada Constructing Ontologies Want to extract prominent concepts/relations from text –tagging, compounds, NP recognition, term frequencies, stopwords, language identification [Brasethvik & Gulla, DKE, 38/1, 2001] Domain doc. coll. Ontology Statistical & linguistic analyses Manual labor

5 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 5 Gulla, Brasethvik and Kaada Common Challenges How to combine linguistic/statistical techniques for document analysis? –Many combinations feasible –Not clear what to use under which circumstances How to support the experimental use of techniques? –Make use of existing techniques –Add new ones –Parameterize techniques –Run techniques in different orders A simple expandable workbench for planning and running sequences of linguistic/statistical text analysis techniques

6 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 6 Gulla, Brasethvik and Kaada Workbench Concept Each technique is a component: –parameters to govern behavior –dependencies with other components Workbench –manages components as building blocks –users can define an analysis as a chain of building blocks –no programming involved as long as appropriate components are available on the network input text output text transform or add parameters

7 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 7 Gulla, Brasethvik and Kaada Workbench Concept Job = input text collection + sequence of parameterized online components Library of components = components available on the network Result = XML representation of documents, all (temporary) results

8 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 8 Gulla, Brasethvik and Kaada Workbench Architecture Components: –Each component a web service –Programmed in any language (Java, Perl, Python, C) –Add to or transform input text document(s) Execution of jobs: –Workbench keeps track of techniques that are available and coordinates their execution –All communication with XML-RPC –All temporary files stored in DOXML format for later inspection

9 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 9 Gulla, Brasethvik and Kaada The Principle of Adding Information kliniske undersøkelser Phrase detection Lemmatization Tagging

10 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 10 Gulla, Brasethvik and Kaada How to Use Workbench? Set up techniques as web services with XML-RPC interface on some networked computers Tell the workbench where to find them Define job: –Specify document(s) to run job on –Select components and set parameters –Decide order of components –Run job

11 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 11 Gulla, Brasethvik and Kaada Selecting a Component

12 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 12 Gulla, Brasethvik and Kaada Defining a Job Iver’s document analysis job consists of 5 techniques

13 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 13 Gulla, Brasethvik and Kaada How did we use it? KITH: Norwegian Center of Medical Informatics –Editorial responsibility for creating and publishing ontologies for medical domains –Traditional approach: Workshops with experts Manual process –New approach Generate concept/relation candidates for health school ontology based on KITH’s document collection on the topic 2.79 MB collection of documents

14 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 14 Gulla, Brasethvik and Kaada The KITH Ontology Construction Job

15 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 15 Gulla, Brasethvik and Kaada Extracted Prominent Concepts

16 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 16 Gulla, Brasethvik and Kaada Extracted concept relationships

17 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 17 Gulla, Brasethvik and Kaada KITH Evaluation KITH case –10 components used to extract concept candidates from document collection –99 of 111 concepts in KITH’s existing ontology found –New concepts detected –Considerable faster than traditional manual approach –Workbench results included in KITH’s experimental ontology-driven IR system: www.volven.nowww.volven.no

18 A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 18 Gulla, Brasethvik and Kaada Conclusions Presented a light-weight and expandable workbench for document analysis and text mining –Easy to set up, easy to use –Limited functionality Future work: –Add more components to library –Allow more advanced job structures (choices, iterations, etc.) Thank you!


Download ppt "A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 1 Gulla, Brasethvik and Kaada A Flexible Workbench for Document."

Similar presentations


Ads by Google