A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June 23-25 2004 1 Gulla, Brasethvik and Kaada A Flexible Workbench for Document.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Advertisements

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
An Ontology Creation Methodology: A Phased Approach
Chapter 5: Introduction to Information Retrieval
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Jon Atle GullaSpråkteknologi og innovasjon1 Språkteknologi i industrielle anvendelser Or: How we have commercialized linguistic technologies 1. Linguistics.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
A Web Crawler Design for Data Mining
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Natural Language Interfaces to Databases Meikiu Lo Gwen Ray October 29, 2003.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Introduction To System Analysis and Design
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
生物資訊程式語言應用 Part 5 Perl and MySQL Applications. Outline  Application one.  How to get related literature from PubMed?  To store search results in database.
Chapter 6: Information Retrieval and Web Search
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Architecture of Decision Support System
Understanding User’s Query Intent with Wikipedia G 여 승 후.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Mining the Biomedical Research Literature Ken Baclawski.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Information Retrieval
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Evaluation of Information Retrieval Systems Xiangming Mu.
An Ontological Approach to Financial Analysis and Monitoring.
Concept mining for programming automation. Problem ➲ A lot of trivial tasks that could be automated – Add field Patronim on Customer page. – Remove field.
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Ganga/Dirac Data Management meeting October 2003 Gennady Kuznetsov Production Manager Tools and Ganga (New Architecture)
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Advanced Computer Systems
Search Engine Architecture
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
OUTLINE Basic ideas of traditional retrieval systems
Information Retrieval
Social Knowledge Mining
Thanks to Bill Arms, Marti Hearst
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Chaitali Gupta, Madhusudhan Govindaraju
Information Retrieval and Web Design
Presentation transcript:

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document Analysis and Text Mining Jon Atle Gulla, Terje Brasethvik and Harald Kaada Norwegian University of Science and Technology Norway 1.Why a linguistic workbench? 2.How does it work? 3.How to use it? 4.How did we use it? 1.Why a linguistic workbench? 2.How does it work? 3.How to use it? 4.How did we use it? Outline:

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Building Search Engines Need to handle syntactic and morphological variation in documents: –language identification, text categorization, stemming/lemmatization, stopwords Want to modify query to improve search result –stemming/lemmatization, spell-checking, query reformulation with ontologies/dictionaries, grammatical analysis, phrasing, anti-phrasing [FAST search engine ( Docs Index Retrieve QueryModified Result page

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Extracting Information From Text Structuring knowledge from text –tagging, compounds, grammatical analysis, ontological interpretation, regular expressions, patter recognition Text Database Ontology Minimal recursion semantics representations [Deep Thought EU project]

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Constructing Ontologies Want to extract prominent concepts/relations from text –tagging, compounds, NP recognition, term frequencies, stopwords, language identification [Brasethvik & Gulla, DKE, 38/1, 2001] Domain doc. coll. Ontology Statistical & linguistic analyses Manual labor

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Common Challenges How to combine linguistic/statistical techniques for document analysis? –Many combinations feasible –Not clear what to use under which circumstances How to support the experimental use of techniques? –Make use of existing techniques –Add new ones –Parameterize techniques –Run techniques in different orders A simple expandable workbench for planning and running sequences of linguistic/statistical text analysis techniques

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Workbench Concept Each technique is a component: –parameters to govern behavior –dependencies with other components Workbench –manages components as building blocks –users can define an analysis as a chain of building blocks –no programming involved as long as appropriate components are available on the network input text output text transform or add parameters

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Workbench Concept Job = input text collection + sequence of parameterized online components Library of components = components available on the network Result = XML representation of documents, all (temporary) results

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Workbench Architecture Components: –Each component a web service –Programmed in any language (Java, Perl, Python, C) –Add to or transform input text document(s) Execution of jobs: –Workbench keeps track of techniques that are available and coordinates their execution –All communication with XML-RPC –All temporary files stored in DOXML format for later inspection

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada The Principle of Adding Information kliniske undersøkelser Phrase detection Lemmatization Tagging

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada How to Use Workbench? Set up techniques as web services with XML-RPC interface on some networked computers Tell the workbench where to find them Define job: –Specify document(s) to run job on –Select components and set parameters –Decide order of components –Run job

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Selecting a Component

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Defining a Job Iver’s document analysis job consists of 5 techniques

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada How did we use it? KITH: Norwegian Center of Medical Informatics –Editorial responsibility for creating and publishing ontologies for medical domains –Traditional approach: Workshops with experts Manual process –New approach Generate concept/relation candidates for health school ontology based on KITH’s document collection on the topic 2.79 MB collection of documents

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada The KITH Ontology Construction Job

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Extracted Prominent Concepts

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Extracted concept relationships

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada KITH Evaluation KITH case –10 components used to extract concept candidates from document collection –99 of 111 concepts in KITH’s existing ontology found –New concepts detected –Considerable faster than traditional manual approach –Workbench results included in KITH’s experimental ontology-driven IR system:

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada Conclusions Presented a light-weight and expandable workbench for document analysis and text mining –Easy to set up, easy to use –Limited functionality Future work: –Add more components to library –Allow more advanced job structures (choices, iterations, etc.) Thank you!