1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

An Introduction to GATE
University of Sheffield NLP Module 4: Machine Learning.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
OntoBlog: Linking Ontology and Blogs Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of Informatics, Japan 2 Asian.
Mining the web to improve semantic-based multimedia search and digital libraries
Information Retrieval in Practice
Search Engines and Information Retrieval
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Information Retrieval in Practice
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Information Retrieval in Practice
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Amy Dai Machine learning techniques for detecting topics in research papers.
Web-Assisted Annotation, Semantic Indexing and Search of Television and Radio News (proceedings page 255) Mike Dowman Valentin Tablan Hamish Cunningham.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Machine Learning in GATE Valentin Tablan. 2 Machine Learning in GATE Uses classification. [Attr 1, Attr 2, Attr 3, … Attr n ]  Class Classifies annotations.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
LREC – Workshop on Crossing media for Improved Information Access, Genova, Italy, 23 May Cross-Media Indexing in the Reveal-This System Murat Yakici,
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
ESWC 2005, Crete, Greece Semantically Enhanced Television News through Web and Video Integration Multimedia and the Semantic Web workshop Borislav PopovMike.
Multimedia Semantic Analysis in the PrestoSpace Project Valentin Tablan, Hamish Cunningham, Cristian Ursu NLP Research Group University of Sheffield Regent.
Multi-Source Information Extraction Valentin Tablan University of Sheffield.
University of Sheffield, NLP Introduction to Text Mining Module 4: Applications (Part 2)
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Data Mining: Concepts and Techniques Course Outline
Speech Capture, Transcription and Analysis App
Searching and browsing through fragments of TED Talks
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY

2 Overview Examples of HLT for the Semantic Web in use Work in context of EU SEKT and PrestoSpace projects Mixed Initiative Information Extraction RichNews (automated annotation of news programs)

3 Mixed Initiative IE -Using Machine Learning for Information Extraction -Human annotator and the system can take the initiative -HA provides some bootstrap examples -MI Engine learns and starts suggesting annotations -HA corrects these annotations -And so on…

4 What is Mixed Initiative IE? Also known as adaptive IE Not active learning ! System selects the next document to be annotated by the user. Improves the performances Active learning is not a part of MI API but will use it

5 Requirements A MI engine must : Work as a background task Suggest annotations only when a given performance level is reached Be easily usable for a non – expert user Fined grained parameters for experts

6 OBIE example Find instances in a document of entities and relations from an ontology Usable by a non-expert end user No learning corpus available Quick adaptation to a new ontology

7 Specifics of the MI API Train a statistical model Use several ML algorithms SVM – Decision Trees – Neural Nets – etc … Compare the ML models and use the one which performs the best at time t Combine the ML models

8 Expected behaviour Performance Time – Size of learning corpus Engine 2 Engine 3 Engine 1 MI Engine Minimal performance level tolerated

9 Limitations of the ML API Configure a file per engine > not suitable for a non expert Set the class definition in the file > problem for OBIE : ontology is not NE > dynamic settings Engine characteristics : binary, numeric, nominal > uniform declaration, automatic conversion Operate on tokens > cannot annotate spans One class per engine > how to set several possible values for an entity with a binary engine?

10 Meta Engine Combines several instances of ‘simple’ engines (has to be same engine type e.g Maxent) Accepts rich descriptions of class & attributes Converts into suitable format for ‘simple’ engine Merges results of embedded engines Behaves like a simple engine Hides the dirty job

11 MI API Architecture Mixed Initiative Engine Mixed Initiative API GUI / client code Meta Engine Orchestrator Evaluation Module DataSet

12 Data Set Information stored as examples No documents Used by Meta Engines Possibly converted to a native Data Set format (e.g. SVMlight) Possibly reuse an existing implementation (WEKA, Yale, …)

13 MI API Architecture Mixed Initiative Engine Mixed Initiative API GUI / client code Meta Engine Orchestrator Evaluation Module DataSet

14 Evaluation module Operate on Data Set Choice for corpus splitting (has K-fold cross validation) Different evaluation metrics

15 MI API Architecture Mixed Initiative Engine Mixed Initiative API GUI / client code Meta Engine Orchestrator Evaluation Module DataSet

16 Orchestrator Core of a MI Engine Manages the Meta Engines Uses the Data Set and Evaluation Module Return information about the Meta Engines confusion matrix – performances – etc … Combines the ML models Convert from / to annotations

17 Allegory : MI Engine = Orchestra Music School ME Teacher Orchestra Conductor ME 1- Learn one/some/all instruments (entities)? 2- Exams for all at the same time ? 3- Good enough and better than existing orchestra? 1- Combine their skills 2- Play for an audience

18 Summary of MI IE Required component for Ontology based Information Extraction State-of-the-art functionalities Reach high performance level by combining classification algorithms

19 RichNews RichNews aims to automate the annotation of news programs Start from recordings of broadcasts. Produces annotations that can be included in a semantic repository (i.e. KIM/Sesame) Works for English but most processing resources can be adapted for other languages.

20 Key Problems Speech recognition produces poor quality transcripts with many mistakes. A news broadcast contains several stories. How do we work out where one starts and another one stops? How can we make a summary or headline for each story from a poor quality transcript? How can we work out what kind of news it reports?

21 Augmented Television News New Broadcasts are often augmented with textual content. –Usually only limited content is available. –The TV company controls content production. Rich News finds content automatically. –Developed on BBC news. –Recordings of broadcasts go in one end. –Relevant news web pages are associated with the stories in the broadcasts fully automatically.

22 Semantic Indexing of News Systems already exist that can index news broadcasts in terms of ‘named entities’ that they refer to. –e.g. Mark Maybury’s Broadcast News Navigator. –Entities such as cities, people, organizations are marked as such. Rich News can improve annotation: –Annotation is in terms of an ontology. –Uses Automatic Speech Recognition, so can be applied when no subtitles are available. –Web pages are used to help find named entities.

23 Using ASR Transcripts ASR is performed by the THISL system. Based on ABBOT connectionist speech recognizer. Optimized specifically for use on BBC news broadcasts. Average word error rate of 29%. Error rate of up to 90% for out of studio recordings.

24 SA General Architecture Source Extractor... Media Object Information Source IE Semantic Index Multi-source IE Merger (?) Story Segmentation Source Detection

25 Multi-source IESource Detection RichNews Architecture... Media Object Story Segmenter Story 1 Story 2 Story N ASR Transcript ASR IE System Web Miner Related Web Pages KIM Ontological IE System Entity 2 Instance Semantic Index THISL ASR GATE/ELAN Manual Annotation (optional)...

26 Topical Segmentation Uses C99 segmenter: Removes common words from the ASR transcripts. Stems the other words to get their roots. Then looks to see in which parts of the transcripts the same words tend to occur.  These parts will probably report the same story.

27 Key Phrase Extraction Term frequency inverse document frequency (TF.IDF): Chooses sequences of words that tend to occur more frequently in the story than they do in the language as a whole. Any sequence of up to three words can be a phrase. Up to four phrases extracted per story.

28 Web Search The Key-phrases are used to search on the BBC, and the Times, Guardian and Telegraph newspaper websites for web pages reporting each story in the broadcast. Searches are restricted to the day of broadcast, or the day after. Searches are repeated using different combinations of the extracted key-phrases. The text of the returned web pages is compared to the text of the transcript to find matching stories.

29 Evaluation Success in finding matching web pages was investigated. Evaluation based on 66 news stories from 9 half-hour news broadcasts. Web pages were found for 40% of stories. 7% of pages reported a closely related story, instead of that in the broadcast. Results are based on earlier version of the system, only using BBC web pages.

30 Using the Web Pages Web pages can be made available to the viewer as additional content. The web pages contain: A headline, summary and section for each story. High quality text that is readable, and contains correctly spelt proper names. They give more in depth coverage of the stories. Web pages could be included in the broadcast by the TV company. Or discovered by a device in viewers’ homes.

31 Semantic Annotation KIM can semantically annotate the text derived from the web pages: KIM will identify people, organizations, locations etc. KIM performs well on the web page text, but very poorly when run on the transcripts directly. This allows for semantic ontology-aided searches for stories about particular people or locations etcetera. So we could search for people called Sydney, which would be difficult with a text-based search.

32 Search for Entities

33 Story Retrieval

34 Summary of RichNews Rich News can automatically segment, describe and classify news broadcasts: Requires an on-line textual source that closely parallels the broadcasts. High precision, moderate recall (so far). Easy to adapt to other languages.