A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.

Slides:



Advertisements
Similar presentations
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Advertisements

Configuration Management
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Clustering Basic Concepts and Algorithms
NEURAL NETWORKS Perceptron
A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
The Analysis of Variance
Dan Simon Cleveland State University
Introduction to Machine Learning Approach Lecture 5.
Systems Analysis I Data Flow Diagrams
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
ILearnNYC / D2L Analytics Portal: I. Navigating Reports.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Text Classification, Active/Interactive learning.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Querying Structured Text in an XML Database By Xuemei Luo.
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
3. Rough set extensions  In the rough set literature, several extensions have been developed that attempt to handle better the uncertainty present in.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
CSC 240 (Blum)1 Introduction to Data Entry, Queries and Reports.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
CSA2050 Introduction to Computational Linguistics Parsing I.
Blog Summarization We have built a blog summarization system to assist people in getting opinions from the blogs. After identifying topic-relevant sentences,
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Advanced data mining with TagHelper and Weka
Text Based Information Retrieval
Introduction Task: extracting relational facts from text
Presentation transcript:

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen

Introduction Previously, information taggers were hand crafted, domain specific, and/or too reliant on lexical clues such as upper case, format, etc. The Universal Spotter is one of the first set of algorithms for unsupervised learning which can identify any category from any large corpus, given some initial examples and context information on what to spot.

Basic Idea Get some prior examples and context for things to spot (called seed) & a large corpus Exploiting redundancy of patterns in text Use those examples to get “new” item and context information to add to original set of rules. Initially precision is high, recall very low. Repeat above cycle to maximize recall, while maintaining/improving precision.

Seeds: What we are looking for Initially, the seed is some information provided by user. It is either Examples or Contextual Information. Examples can be highlighted in the text ( “Microsoft”, “toothbrushes”). Context information can also be specified (both Internal & External). For example, “Name ends with Co.” or “appears after produced”. Negative examples and context information such as “Not to the right of produced”.

The Cyclic Process 1.Build rules from the initial examples and context info. 2.Find further examples of this concept, in the corpus, while trying to maximize precision/recall. 3.As we find more examples of the concepts, we can find more contextual information. 4.Use the expanded context info to find more entities.

Simple Example Suppose we have the seeds “Co” and “Inc” initially and the following text. “Henry Kaufman is president of Henry Kaufman & Co., …..president of Gabelli Funds Inc. ; Claude. N. Rosenberg is named president of Thomson S.A ….” Use “Co” and “Inc” to pick out Henry Kaufman & Co and Gabelli Funds Inc. Use these new seeds to get contextual information such as for example, “president of” before each of the entities. Use “president of” to find “Thomson S.A.”

The Classification Task So our goal is to decide whether a sequence of words contains a desired entity/concept. This is done by calculating significance weights, SW, and then combining them.

The Process: In Detail Initially some preprocessing is done including tokenization, POS tagging and lexical normalization or stemming. POS tagging help to delineate which sequence of words might contain the desired entities. These steps reduce the amount of noise.

How to calculate SW Consider sequence of words W1,W2,…Wm in text which is of interest. There is a window of size n on either side of the central unit where one looks for contextual information. Then do the following: Make up pairs of (word, position), where position is one of preceding (p) context, central unit (s) or following (f) context for all words within the window of size n. Similarly make up pairs of (bigram, position). Make up 3-tuples of (word, position, distance) for the same sequence of words, where distance is the distance from W1 or Wm. (for units in W1 thru Wm take distance from Wm).

An SW Calculation Example Example:... boys kicked the door with rage... with window n=2, and central unit, “the door”. The generated tuples (called evidence items) are : (boys, p), (kicked, p), (the, s), (door, s), (with, f), (rage, f), ((boys, kicked), p), ((the, door)), s), ((with,,rage), f), (boys, p, 2), (kicked, p, 1), (the, s, 2), (door, s, 1), (with, f, 1), (rage, f, 2), ((boys, kicked), p, 1), ((the, door)), s, 1), ((with,,rage), f, 1)

SW Calculation continued …. 2 groups of items, A is the group of accepted items and R the group of rejected items. Use these groups, to calculate SW: where s is a constant to filter noise and f(x,X) is frequency of x in X. SW as described here takes values between -1.0 & 1.0 For some e, SW(t)>e>0 is taken as a +ve evidence and SW(t)<-e is taken as –ve evidence. SW (t) = f(t,A)-f(t,R) f ( t, A ) + f ( t, R ) > s f(t,A)+y(t,R) 0 otherwise

Combining SW weights Then, these SW weights are combined and if this exceeds a threshold, then they become available during the tagging stage. the primary scheme used by the authors for combining is: x + y - xy if x>O and y>O x O y = x + y + xy if x<O and y<O x + y otherwise Note: Values still remain with [-1.0, 1.0]

Bootstrapping The basic bootstrapping process then looks like this: Procedure Bootstrapping Collect seeds l o o p Training phase (calc. SW weights, combine, add to rules) Tagging phase (use all accumulated rules to tag) until Satisfied.

Experiments and Results Organizations : Training on 7 MB WSJ corpus, Testing on 10 selected articles. Initially, precision 97% but recall 49% Maximized to p=95% & r= 90% after 4 th cycle Similar experiment for identifying products but worse results

Improvements Different weighing and combining schemes Universal Lexicon Lookups: Can verify accepted items in existing online lexical databases. Program cannot deal with Conjunctions of noun phrases due to identification difficulties.

Some Considerations Not clear how many initial seeds were provided The program is described for identifying one category of items at a time but could be extended to more. A limitation is that it might not be possible to spot certain context/examples due to noise in data and also for entities that do not have obvious context patterns. The POS tagger errors are inherited.