JRC 2005/05/10 Automatic event extraction from text on the base of linguistic and semantic annotation Thierry Declerck DFKI – Language Technology Lab.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

An Ontology Creation Methodology: A Phased Approach
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
TU/e technische universiteit eindhoven Hera: Development of Semantic Web Information Systems Geert-Jan Houben Peter Barna Flavius Frasincar Richard Vdovjak.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Statistical NLP: Lecture 3
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
1 Words and the Lexicon September 10th 2009 Lecture #3.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Xyleme A Dynamic Warehouse for XML Data of the Web.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Introduction to Computational Linguistics Lecture 2.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
ACCESS TO QUALITY RESOURCES ON RUSSIA Tanja Pursiainen, University of Helsinki, Aleksanteri institute. EVA 2004 Moscow, 29 November 2004.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
10 December, 2013 Katrin Heinze, Bundesbank CEN/WS XBRL CWA1: DPM Meta model CWA1Page 1.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Flexible Text Mining using Interactive Information Extraction David Milward
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Digital libraries and web- based information systems Mohsen Kamyar.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
MedKAT Medical Knowledge Analysis Tool December 2009.
Information Retrieval
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
Levels of Linguistic Analysis
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Information Extraction from Web Resources CENG 770.
Ontologies COMP6028 Semantic Web Technologies Dr Nicholas Gibbins
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
COMP6215 Semantic Web Technologies
Statistical NLP: Lecture 3
Information Retrieval and Web Search
Information Retrieval and Web Search
Machine Learning in Natural Language Processing
CSE 635 Multimedia Information Retrieval
Levels of Linguistic Analysis
Introduction to Information Retrieval
Linguistic Essentials
System Model Acquisition from Requirements Text
Presentation transcript:

JRC 2005/05/10 Automatic event extraction from text on the base of linguistic and semantic annotation Thierry Declerck DFKI – Language Technology Lab

JRC 2005/05/10 Events … Involve entities and relations between then Implies a change of states –Example: The striker of Liverpool shot a wonderful goal in the 87. Minute. 1 event (goal-shot) 2 entities (person and team) 1 change of state (the scoring)

JRC 2005/05/10 Events in textual documents Various types of text –Structured: Example and Example_2Example Example_2 For processing, pattern matching techniques required. Very few linguistic knowledge needed –Semi-structured: ExampleExample Requires a mixture of pattern matching and more linguistic knowledge –Unstructured: ExampleExample Requires a mixture of layout analysis and linguistic knowledge All types of text require a domain specific knowledge base (ontology) for event extraction

JRC 2005/05/10 Domain Knowledge Domain knowledge can be organised in terminologies, thesauri, taxonomies or ontologies. Example of a (non-formal) multingual ontology for the soccer domain.non-formal More on ontology engineering in the talk by Borislav

JRC 2005/05/10 Automatic Event Extraction from Text is A combination of human language technology (HLT) and semantic web technologies (ontologies) Can also be done on the base of purely statistical means (with minimal linguistic knowledge), but we concentrate here on the HLT-based approach

JRC 2005/05/10 What is Human Language Technology

JRC 2005/05/10 Linguistic Analysis Language technology tools are needed to support the upgrade of the actual web to the Semantic Web (SW) by providing an automatic analysis of the linguistic structure of textual documents. Free text documents undergoing linguistic analysis become available as semi-structured documents, from which meaningful units can be extracted automatically (information extraction) and organized through clustering or classification (text mining). Here we focus on the following linguistic analysis steps that underlie the extraction tasks: tokenization, morphological analysis, part-of-speech tagging, chunking, dependency structure analysis, semantic tagging.

JRC 2005/05/10 Tokenisation Tokenisation deals with the detection of the word units in a text and with the detection of sentence boundaries. The markets acknowledge the measures taken on the 24 th of September by the CEO of XYZ Corp.

JRC 2005/05/10 Morphological Analysis Morphological analysis is concerned with the inflectional, derivational, and compounding processes in word formation in order to determine properties such as stem and inflectional information. Together with part-of-speech (PoS) information this process delivers the morpho-syntactic properties of a word. While processing the German word Häusern (houses) the following morphological information should be analysed: [PoS=N NUM=PL CASE=DAT GEN=NEUT STEM=HAUS]

JRC 2005/05/10 Part-of-Speech Tagging Part-of-Speech (PoS) tagging is the process of determining the correct syntactic class (a part-of-speech, e.g. noun, verb, etc.) for a particular word given its current context. The word works in the following sentences will be either a verb or a noun: He works [N,V] the whole day for nothing. His works [N,V] have all been sold abroad. PoS tagging involves disambiguation between multiple part-of- speech tags, next to guessing of the correct part-of-speech tag for unknown words on the basis of context information.

JRC 2005/05/10 Chunking Chunks are sequences of words which are grouped on the base of linguistic properties, such as nominal, prepositional, adjectival and adverbial phrases and verb groups. [ NP His works] [ VG have] [ NP all] [ VG been sold] [ AdvP abroad].

JRC 2005/05/10 Named Entities detection Related to chunking is the recognition of so-called named entities (names of institutions and companies, date expressions, etc.). The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) with the definition of regular expression patterns. Named entity recognition can be included as part of the linguistic chunking procedure and the following sentence fragment: …the secretary-general of the United Nations, Kofi Annan,… will be annotated as a nominal phrase, including two named entities: United Nations with named entity class: organization, and Kofi Annan with named entity class: person

JRC 2005/05/10 Dependency Structure Analysis A dependency structure consists of two or more linguistic units that immediately dominate each other in a syntax tree. The detection of such structures is generally not provided by chunking but is building on the top of it. There are two main types of dependencies that are relevant for our purposes: On the one hand, the internal dependency structure of phrasal units or chunks and on the other hand the so- called grammatical functions (like subject and direct object).

JRC 2005/05/10 Internal Dependency Structure. In linguistic analysis, for this we use the terms head, complements and modifiers, where the head is the dominating node in the syntax tree of a phrase (chunk), complements are necessary qualifiers thereof, and modifiers are optional qualifiers. Consider the following example: The shot by Christian Ziege goes over the goal. The prepositional phrase by Christian Ziege (containing the named entity Christian Ziege) depends on (and modifies) the head noun shot.

JRC 2005/05/10 Grammatical Functions Determine the role (function) of each of the linguistic chunks in the sentence and allow to identify the actors involved in certain events. So for example in the following sentence, the syntactic (and also the semantic) subject is the NP constituent The shot by Christian Ziege: The shot by Christian Ziege goes over the goal. This nominal phrase depends on (and complements) the verb goes, whereas the Noun shot is the head of the NP (it this the shot going over the goal, and not Christian Ziege!)

JRC 2005/05/10 Semantic Tagging Automatic semantic annotation has developed within language technology in recent years in connection with more integrated tasks like information extraction, which require a certain level of semantic analysis. Semantic tagging consists in the annotation of each content word in a document with a semantic category. Semantic categories are assigned on the basis of a semantic resources like WordNet for English or EuroWordNet, which links words between many European languages through a common inter-lingua of concepts.

JRC 2005/05/10 Semantic Resources Semantic resources are captured in dictionaries, thesauri, and semantic networks, all of which express, either implicitly or explicitly, an ontology of the world in general or of more specific domains, such as medicine. They can be roughly distinguished into the following three groups: Thesauri: Semantic resources that group together similar words or terms according to a standard set of relations, including broader term, narrower term, sibling, etc. (like Roget) Semantic Lexicons: Semantic resources that group together words (or more complex lexical items) according to lexical semantic relations like synonymy, hyponymy, meronymy, and antonymy (like WordNet) Semantic Networks: Semantic resources that group together objects denoted by natural language expressions (terms) according to a set of relations that originate in the nature of the domain of application (like UMLS in the medical domain)

JRC 2005/05/10 The MeSH Thesaurus MeSH (Medical Subject Headings) is a thesaurus for indexing articles and books in the medical domain, which may then be used for searching MeSH-indexed databases. MeSH provides for each term a number of term variants that refer to the same concept. It currently includes a vocabulary of over 250,000 terms. The following is a sample entry for the term gene library (MH is the term itself, ENTRY are term variants): MH =Gene Library ENTRY =Bank, Gene ENTRY =Banks, Gene ENTRY =DNA Libraries ENTRY =Gene Bank etc.

JRC 2005/05/10 The WordNet Semantic Lexicon WordNet has primarily been designed as a computational account of the human capacity of linguistic categorization and covers an extensive set of semantic classes (called synsets). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. Synsets are actually not made up of lexical items, but rather of lexical meanings (i.e. senses)

JRC 2005/05/10 The WordNet Semantic Lexicon WordNet has primarily been designed as a computational account of the human capacity of linguistic categorization and covers an extensive set of semantic classes (called synsets). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. Synsets are actually not made up of lexical items, but rather of lexical meanings (i.e. senses)

JRC 2005/05/10 WordNet: An Example The word 'tree' has two meanings that roughly correspond to the classes of plants and that of diagrams, each with their own hierarchy of classes that are included in more general super-classes: tree woody_plant 0 ligneous_plant vascular_plant 0 tracheophyte plant 0 flora 0 plant_life life_form 0 organism 0 being 0 living_thing entity 0 something tree 0 tree_diagram plane_figure 0 two-dimensional_figure figure shape 0 form attribute abstraction 0

JRC 2005/05/10 What is the Semantic Web The Semantic Web is a new initiative to transform the web into a structure that supports more intelligent querying and browsing, both by machines and by humans. This transformation is to be supported through the generation and use of metadata constructed via web annotation tools using user-defined ontologies that can be related to one another. Somewhere on the web

JRC 2005/05/10 Semantic Web x C D Web-Page Annotation Tool Ontology Construction Tool End User Community Portal Inference Engine Metadata Repository Annotated Web Pages Ontology Articulation Toolkit Ontologies Agents Based on

JRC 2005/05/10 Extracting Events from Structured Documents Detecting Metadata in our Example: –Type of game: N/A –Teams involved: England - Deutschland –Players: Deutschland: Kahn (2) - Matthaeus (3) - Babbel (3,5), –Final (and intermediate) score:1:0 (0:0) –Referee:Schiedsrichter: Collina, Pierluigi (Viareggio) –Date: N/A –Etc…

JRC 2005/05/10 Extracting Events from Structured Documents (2) Detecting Events in our Example: –Substitution: Eingewechselt: 61. Gerrard fuer Owen, –Goal: Tore: 1:0 Shearer (53., Kopfball, Vorarbeit Beckham) –Cards: Gelbe Karten: Beckham - Babbel, Jeremies

JRC 2005/05/10 Results in XML Automatically extracted events (and entities and relations) from structured text, on the base of patterns (DTD) of typical expressions and the soccer ontology. Example and Example_2DTDExampleExample_2 Since various results are available in XML files, those results can be merged automatically, guided by the ontology. Example. This is supporting an incremental and dynamic extraction.Example

JRC 2005/05/10 Extracting Events from Semi- Structured Documents Need of linguistic processing, for providing of a basic structure of the document, which allows the domain specific annotation. Example. Example

JRC 2005/05/10 Extracting Events from Semi- Structured Documents (2) Using as well the results from the semantic annotation of the structured documents, supporting incremental extraction: Example. Example

JRC 2005/05/10 Actual Development Extracting information from multilingual balance sheets (WINS eTen project), extending this to unstructured text and extracting relations and events from annexes to balance sheets (upcoming Project MUSING). Detecting positive/negative mentioning of entities in news documents (project Direct-Info on Media Monitoring). Example.Example

JRC 2005/05/10 Further Challenge for HLT Not only use HLT for the semantic annotation of web pages (or other documents), but use HLT for supporting ontology extraction/learning from the web (or other documents)

JRC 2005/05/10 Example of semantic relation extraction in bio-medicine [Rheumatoid arthritis] [is characterized] [by progressive synovial inflammation and joint destruction] [.]

JRC 2005/05/10 Open issues for HLT and SW To achieve a better coordination for improving semantic annotation results Development and use of standards for interelated linguistic and semantic annotation (see eContent Project LIRICS for standards for language resources)

JRC 2005/05/10 Interoperable Standards?

JRC 2005/05/10 Thank you!