Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.

Slides:



Advertisements
Similar presentations
ISDSI 2009 Francesco Guerra– Università di Modena e Reggio Emilia 1 DB unimo Searching for data and services F. Guerra 1, A. Maurino 2, M. Palmonari.
Advertisements

Chapter 5: Introduction to Information Retrieval
Improved TF-IDF Ranker
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Hermes: News Personalization Using Semantic Web Technologies
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
OntoBlog: Linking Ontology and Blogs Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of Informatics, Japan 2 Asian.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
March 17, 2008SAC WT Hermes: a Semantic Web-Based News Decision Support System* Flavius Frasincar Erasmus University Rotterdam.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Presented by Zeehasham Rasheed
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Unit 3 Web Search Engines. Can You Find the Answers? n Connect to Google Google n Search for items on Iran Records ________ n Combine Iran with nuclear.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontological Classification of Web Pages Zafer Erenel Many users use search engines to locate and buy goods and services (such as choosing a vacation).
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Universit at Dortmund, LS VIII
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Understanding User’s Query Intent with Wikipedia G 여 승 후.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Oxygen Indexing Relations from Natural Language Jimmy Lin, Boris Katz, Sue Felshin Oxygen Workshop, January, 2002.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Flickr Tag Recommendation based on Collective Knowledge BÖrkur SigurbjÖnsson, Roelof van Zwol Yahoo! Research WWW Summarized and presented.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
A New Algorithm for Inferring User Search Goals with Feedback Sessions.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Data Mining Chapter 6 Search Engines
Searching with context
Presentation transcript:

Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability through Semantic Data and Service Integration (ISDSI’09) Cagmoli (Genova), Italy, 25th June 2009 Distributed Information Systems Group, University of Zaragoza, Spain Databases Group, Univ. Of Modena e Reggio Emilia, Italy * +

Outline  Introduction.  Basic Architecture of the system:  Discovering the Semantics of User Keywords.  Semantics-Guided Data Retrieval. Improvements to the Basic Architecture:  Probabilistic Word Sense Disambiguation.  Retrieval of Synonyms of User Keywords.  Conclusions and Future Work. ISDSI’09 Cagmoli (Italy), 25th June 2009

Introduction ISDSI’09 Cagmoli (Italy), 25th June 2009  Search engines have become the best allies of users.  They index most no hidden Web.  They succeed when users ask for popular information on the Web.  Traditional Search engines are based on syntactic techniques (no semantics):  Polysemous Words: with several meanings (senses/interpretations). Example: Mouse (animal, Mickey Mouse, input device, etc).  Synonymous Words: Different representations (words) with the same meaning.  Example: automobile or car  Example: lorry or truck

Introduction ISDSI’09 Cagmoli (Italy), 25th June 2009 Truck 172,000,000 Lorry 4,760,000

Introduction: Semantic Search ISDSI’09 Cagmoli (Italy), 25th June 2009  Semantic Search engines can overcome the problems of traditional search engines.  Consider the semantics of keywords and not only its representation (how they are written).  Our proposal:  Classify the results of traditional search in different categories by considering their possible meanings.  Considers the synonyms of the user keywords to retrieve more pages.

Introduction: Web Clustering ISDSI’09 Cagmoli (Italy), 25th June 2009  Along last decades, different techniques to cluster documents have appeared:  Traditional clustering algorithms cannot be applied to search result clustering.  Features that a clustering for web search should:  Separate relevant pages for the user from irrelevant ones.  Provide browsable summaries of each cluster.  Be applied to snippets and not to whole pages.  Be incremental and provide results ASAP.  Allow the overlapping between groups.

Outline Introduction.  Basic Architecture of the system:  Discovering the Semantics of User Keywords:  Obtaining the possible keyword senses (meanings).  Selecting the most probable sense of each user keyword.  Semantics-Guided Data Retrieval:  Lexical annotations of results of a traditional search.  Categorization of results.  Improvements to the Basic Architecture:  Conclusions and Future Work. ISDSI’09 Cagmoli (Italy), 25th June 2009

Basic Architecture of the system Discovering the semantics of User keywords Semantics-Guided Data Retrieval Extraction of keyword senses Disambiguation of keyword senses Selection of the most probable intended category Categorization of hits Lexical annotation of hits: title and snippet Search keywords in traditional search engines Possible keyword senses Selected senses Hits (results of a traditional SE) Annotated Hits by considering the Possible Keyword ss Clusters or categories of hits Semantic Cluster of Hits Keywords  Goal: Discover the intended meaning of each user keyword.  How: Word Sense Disambiguation Algorithm performs in two phases:  Phase 1: Discover the possible meanings (senses) from semantic resources such as Ontologies, Thesaurus, etc.  Phase 2: For each keyword select one intended meaning by considering the context.

Obtaining the Possible Keyword Senses of each User Keyword ISDSI’09 Cagmoli (Italy), 25th June 2009  Consulting a well-known general-pupose shared thesaurus such as WordNet:  Advantages: It is fast and provides a reliable set of senses.  Disadvantages: It does not cover with the same detail different domains of knowledge. Ex: The meaning of developer as “sb who designs and implements software” does not appear.  Consulting the knowledge stored in different pools of ontologies available on the Web and using synonym probability measures to remove redundant interpretations:  Advantages: The more ontologies consulted, the more chances to find the semantics assigned by the user.  Disadvantages: It could introduce noise and irrelevant information.

Obtaining the Possible Keyword Senses of each User Keyword ISDSI’09 Cagmoli (Italy), 25th June 2009  Option 1: Consulting a well-known general-pupose shared thesaurus such as WordNet.  Option 2: Consulting the knowledge stored in different pools of ontologies and using synonym probability measures to remove redundant interpretations.  The trade-off between the two approaches is not totally clear:  Implement both options beginning by the Wordnet one.  Perform experimental evaluation to decide which approach to consider.

Discovering the semantics of User keywords Extraction of keyword senses Disambiguation of keyword senses Possible keyword senses Selected senses ISDSI’09 Cagmoli (Italy), 25th June 2009 Selecting the most probable sense of each User Keyword  Goal: Select the most probable intended meaning for each user keyword.  How: Using Word Sense Disambiguation techniques:  Many features can be considered in the context of written document, but here the process is more complex.  No syntax of whole sentences, few keywords (<5), etc.

ISDSI’09 Cagmoli (Italy), 25th June 2009 Selecting the most probable sense of each User Keyword  Try to emulate the behaviour of a human by considering the possible meanings of the rest of keywords:  If star appears in the context “Star Hollywood”, then the most probable intended meaning is “famous actor/actress”.  If star appears in the context “Star Sky”, then the most probable intended meaning is “celestial body”.  The architecture proposed does not depend on a particular Word Sense Disambiguation technique:  Probabilistic Word Sense Disambiguation techniques that combine different algorithms.

Outline Introduction.  Basic Architecture of the system: Discovering the Semantics of User Keywords: Obtaining the possible keyword senses (meanings). Selecting the most probable sense of each user keyword.  Semantics-Guided Data Retrieval:  Lexical annotations of results of a traditional search.  Categorization of results.  Improvements to the Basic Architecture:  Conclusions and Future Work. ISDSI’09 Cagmoli (Italy), 25th June 2009

Semantics-Guided Data Retrieval Selection of the most probable intended category Categorization of hits Lexical annotation of hits: title and snippet Search keywords in traditional search engines Possible keyword senses Selected senses Hits (results of a traditional SE) Annotated Hits by considering the Possible Keyword ss Cluster or categories of hits Semantic Cluster of Hits Keywords  Goal: Select hits relevant for the user and filter irrelevant ones.  Phase 1: Retrieval by using traditional techniques.  Phase 2 and 3: Lexical annotations of hits and classification of them by using Word Sense Disambiguation.  Phase 4: Selection of the category corresponding to the selected senses.  How:

 Goal: Associated to each user keyword that appears in each returned hit (title, URL and snippets) a meaning by considering the possible the meaning of the keyword.  Cleasing each hit to remove stopwords and mark without semantic information.  Performing WSD by considering the context of the words (its neighbour words in a window).  How: Lexical Annotation of the Results of a Traditional Search Engine Cleasing of hits Possible keyword senses Hits (results of a traditional SE). For each hit title, URL and Snippet Annotated Hits by considering the Possible Keyword Senses Lexical Annotation

Lexical Annotation of the Results of a Traditional Search Engine Cleasing of hits Possible keyword senses Hits (results of a traditional SE). For each hit title, URL and Snippet Annotated Hits by considering the Possible Keyword Senses Lexical Annotation  Only information from snippets is used to perform the lexical anotation  New senses for words appears but only when they are widespreaded they are integrated in semantic resources

Categorization of the Annotated Results ISDSI’09 Cagmoli (Italy), 25th June 2009 Hit1(s11, s21), Hit2 (s11, s22), Hit3(s11, s22), Hit4(s11, ?),… K1 (Hollywood): S11 K2 (Star): S21(Celestial body), S22 (Actor/Actres) C1(S11, S21): Hit1, … C2(S1U, S21): Hit4,... C3(S11, S22): Hit2, Hit3,... C4(S1U, S22):... C5(S1U, S2U):...  Goal: Associated to annotated hit a category.  How:  Defining the categories considering the possible keyword senses.  Associated to each hit a category by considering its annotations.

Categorization of the Annotated Results ISDSI’09 Cagmoli (Italy), 25th June 2009 C1(S11, S21): Hit1, … C2(S1U, S21): Hit4,... C3(S11, S22): Hit2, Hit3,... C4(S1U, S22):... C5(S1U, S2U):... C6(S11, S2U):...  Select the category (cluster) that correspond to the selected senses of the user.  The hits of each category are orderd following the ranking return by the search engine. Unknown Sense for Hollywood Unknown Sense for star

Problems of Basic Architecture ISDSI’09 Cagmoli (Italy), 25th June 2009  Problem 1: The system only selects the most probable intended category but the user can be interested in other one.  Problem 2: Sometimes, even for a human it is very difficult to decide which is the meaning which is being used for a word.  Problem 3: The system is not considering the synonyms of the keywords

Outline Introduction. Basic Architecture of the system: Discovering the Semantics of User Keywords. Semantics-Guided Data Retrieval.  Improvements to the Basic Architecture:  Probabilistic Word Sense Disambiguation.  Retrieval of Synonyms of User Keywords.  Conclusions and Future Work. ISDSI’09 Cagmoli (Italy), 25th June 2009

Probabilistic Word Sense Disambiguation  Show more intrepretations to the user: Instead of only showing to the user the category corresponding to the most probable senses, showing him/her all the categories sorted by considering the probability associated to each category. C1(S11, S21): Hit1, … C2(S1U, S21): Hit4,... C3(S11, S22): Hit2, Hit3,... C4(S1U, S22):... C5(S1U, S2U):... C6(S11, S2U):... C3(S11, S22): Hit2, Hit3,... C4(S1U, S22):... C1(S11, S21): Hit1, … C2(S1U, S21): Hit4,... C6(S11, S2U):... C5(S1U, S2U):...

Probabilistic Word Sense Disambiguation  Probabilistic Word Sense Disambiguation:  It is based on a probabilistic combination of different WSD algorithms so the process is not affected by the effectiveness of a single algorithm.  Associate a probability to each lexical annotation that indicates the reliability level of the annotation.  So, each hit will be associated to several categories with a certain probability. 0,75 0,20 0,05

C3(S11 (hollywod), S22(star)): Hit2, Hit3,... ISDSI’09 Cagmoli (Italy), 25th June 2009 Retrieval of Synonyms of User Keywords  Probabilistic Word Sense Disambiguation:  Associate to each hit the product of the probabilities of its annotations and use this value to rank the hits clasiffied inside a category (group of cluster).  Enrichment of the clusters with retrieval of synonyms of the senses that represent that category. Celebrity, actor/actress

Outline Introduction. Basic Architecture of the system: Discovering the Semantics of User Keywords. Semantics-Guided Data Retrieval. Improvements to the Basic Architecture: Probabilistic Word Sense Disambiguation. Retrieval of Synonyms of User Keywords.  Conclusions and Future Work. ISDSI’09 Cagmoli (Italy), 25th June 2009

Related Work  There exist several techniques for clustering the results of a web search, but most of them based only on statistics techniques.  Some approaches consider semantics, such as:  Hao et al. 2008: Uses only WordNet and assumes a predefined set of categories.  Hemayati et al. 2007: Limited to queries with a single keyword and does not allow overlapping categories.

ISDSI’09 Cagmoli (Italy), 25th June 2009 Conclusions and Future Work  We have proposed an architecture to group the results of a standard search engine in different categories:  The categories are defined by the senses of the input keywords.  The system has desirable features in this kind of systems.  Non-popular searches do not remain hidden.  Next steps:  Implementation of the system proposed.  Design a set of experiments with users to evaluate it.

Semantic Access to Data from the Web Raquel Trillo , Laura Po +, Sergio Ilarri *, Sonia Bergamaschi +, E. Mena * 1st International Workshop on Interoperability through Semantic Data and Service Integration ISDSI’09 Cagmoli (Genova), Italy, 25th June Univ. Zaragoza Univ. Of Modena e Reggio Emilia Grazie Mille! Thank you very much! Questions and suggestions. * +