Semantic Indexing with Typed Terms using Rapid Annotation

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
DriveWorks – Product Configurator
Partnership Assessment Ref ICMM Tool # 4
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
1 Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University.
1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Searching the Semantic Web. Introduction  Research Focuses: IE Ontologies (creating, languages, merging, storing, querying)  Next Sep: Using the Semantic.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Federated Searching Pre-Conference Workshop - The federated searching cookbook Qin Zhu HP Labs Research Library February 18, 2007.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Post-Ranking query suggestion by diversifying search Chao Wang.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan.
Adaptive Faceted Browsing in Job Offers Danielle H. Lee
T OWARDS D ECISION S UPPORT AND G OAL A CHIEVEMENT : I DENTIFYING A CTION -O UTCOME R ELATIONSHIPS F ROM S OCIAL M EDIA Speaker: Jim-An Tsai Advisor: Jia-ling.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
Measuring Monolinguality
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Lecture 8: Word Clustering
Lecture 12: Relevance Feedback & Query Expansion - II
Text Based Information Retrieval
An Automatic Construction of Arabic Similarity Thesaurus
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Global Enterprise Search
Statistical NLP: Lecture 9
CS 430: Information Discovery
Information Retrieval and Web Design
Statistical NLP : Lecture 9 Word Sense Disambiguation
CS 430: Information Discovery
Presentation transcript:

Semantic Indexing with Typed Terms using Rapid Annotation Chris Biemann University of Leipzig 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen

Outline The benefits of typed terms and relations Alleviating the ontology bottleneck Rapid annotation Sources for annotation candidates Annotation tools Case study: Annotation of „Deutscher Wortschatz“ Conclusion

Typed terms and relations The bag of words model treats all terms equally Document similarity based on all terms No views on data possible Typed terms and relations: Multiple views on documents w.r.t. types Document similarity restricted to types and augmented by relations Enables some tasks of Question Answering

Motivating example: untyped Documents: The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. „Weapon sales increased“, a government official stated, „especially tanks sell well“ A holiday cruise on a yacht invites to take photos of seagulls. The photos show A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: Doc 1 Doc 2 Doc 3 Doc 4 - 3 2 1 2 4 3

Motivating example: type PERSON Documents: The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. „Weapon sales increased“, a government official stated, „especially tanks sell well“ A holiday cruise on a yacht invites to take photos of seagulls. The photos show A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: Doc 1 Doc 2 Doc 3 Doc 4 - 2 1 2 4 3

The ontology bottleneck Semantic Web people believe that annotation with ontology relations will enable semantic search, ... Annotation: Chose an ontology, label all instances in the document Problems: New documents have to be annotated all over again Merging of ontologies Despite tools, users are reluctant to annotate their documents interface Merged ontology Anno 1 Anno 2 Anno 3 Anno n .... Doc 1 Doc 2 Doc 3 Doc n

Centralized annotation Types and relations for terms are assigned globally and once-for-all. No (logically grounded, consistent) ontology, but a free collection of types and relations suited to the problem Annotation is done for document collections interface Annotation Doc 1 Doc 2 document collection .... Doc 3 Doc n

Generating Candidates for Annotation Given N terms from the collection, it is not feasible to present N² pairs to an annotator. Most of the pairs will not be related Needed: Method that produces terms with similar types and related pairs at high rate Method here: Co-occurrence statistics: Pairs of terms that occur significantly often together in sentences/documents. Co-occurrences of higher orders: pairs of terms that have similar co-occurrence statistics Co-occurrences reflect syntagmatic and paradigmatic relations, the former are ruled out in higher orders

The cats and dogs example cat co-occurrences: dog, her, food, pet, litter, she, burglar, animal, my, mouse, feline, Garfield, like, Cat, bag cat order 2: cats, pet, dog, animals, animal, dogs, pets, neutered, her, she, Synindex, like, tabbie, pigs, shelter cat order 4: pet, pets, cats, dog, pigs, animals, dogs, animal, owners, zoo, wild, birds, rabbits, puppies, tiger

Graphical annotation tool: colourizing co-occurrences

Specifying types and relations Click on node / edge opens context menu restricted to POS

Web-based annotation tool for arbitrary candidate sources

Rule-based candidate generation If some annotation is already present, then rules can be specified to obtain candidates at even higher rate. It is possible to guess the type of candidates Example: Rule 1: If IS-A(A,B) and PROPERTY(B), then PROPERTY(A) yields LIVING(dog) as candidate Rule 2: If IS-A(A,B) and COHYPONYM(A,C) then IS-A(C,B) yields IS-A(cat, animal) as candidate animal LIVING IS-A dog cat LIVING CO-HYPONYM

Tool to accept or reject rule-based candidates

Case study: Annotating Deutscher Wortschatz www. wortschatz Case study: Annotating Deutscher Wortschatz www.wortschatz.uni-leipzig.de In terms of numbers: In 1‘000 hours, annotators could chose between 46 semantic types and 57 relations, and produced 150‘000 type instances and 150‘000 relation instances for over 80‘000 distinct terms, that is text coverage of 90%, with a speed of 5 units per minute

Different relations from different sources

Example: Query resolution with types and relations Query: „Find documents mentioning at least two heads of computer companies!“ 1. Translate into formal query: Qset = {B | IS-A(A, computer company), HEAD-OF(B,A)} b1 Qset, b2Qset, b1  b2 2. Access search engine with possible b1, b2

What Google found: Find documents mentioning at least two heads of computer companies! #1 hit 14.08.2005 www.google.com

Conclusion Typed terms and relation can facilitate processing of electronic documents for a wide range of applications Rapid annotation alleviates the acquisition bottleneck by - globally annotating - local dependencies Intuitive tools for annotation are highly important to achieve large amounts in short time

QUESTIONS?!? THANK YOU

Bonus material Co-occurrences Co-occurrences of higher orders

Statistical Co-occurrences occurrence of two or more words within a well-defined unit of information (sentence, nearest neighbors) Significant Co-occurrences reflect relations between words Significance Measure (log-likelihood): - k is the number of sentences containing a and b together - ab is (number of sentences with a)*(number of sentences with b) - n is total number of sentences in corpus

Iterating Co-occurrences (sentence-based) co-ocurrences of first order: words that co-occur significantly often together in sentences co-occurrences of second order: words that co-occur significantly often in collocation sets of first order co-occurrences of n-th order: words that co-occur significantly often in collocation sets of (n-1)th order When calculating a higher order, the significance values of the preceding order are not relevant. A co-occurrence set consists of the N highest ranked co-occurrences of a word.

Constructed Example I Ord 1 dog terrier cat mouse barking bite yelp - 3 1 - 2

Constructed Example II Ord 2 dog terrier cat mouse barking bite yelp x - Ord 3 dog terrier cat mouse barking bite yelp - 1