CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Chapter 7 Retrieval Models.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Mehran Sahami Timothy D. Heilman A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Vector Space Model CS 652 Information Extraction and Integration.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Post-Ranking query suggestion by diversifying search Chao Wang.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 CS 430: Information Discovery Lecture 5 Ranking.
Artificial Intelligence Techniques Internet Applications 4.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
CSCE 590 Web Scraping – Information Extraction II
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Clustering of Web pages
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Wikitology Wikipedia as an Ontology
CSE 635 Multimedia Information Retrieval
Chapter 5: Information Retrieval and Web Search
Junghoo “John” Cho UCLA
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR

Road Map Cross Lingual IR Motivation CLIA architecture CLIA demo Ranking Various Ranking methods Nutch/lucene Ranking Learning a ranking function Experiments and results

Cross Lingual IR Motivation Information unavailability in some languages Language barrier Definition: Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (wikipedia) Example: A user may ask query in Hindi but retrieve relevant documents written in English.

4 Why CLIR? Query in Tamil English Document System Marathi Document search English Document Snippet Generation and Translation

Cross Lingual Information Access Cross Lingual Information Access (CLIA) A web portal supporting monolingual and cross lingual IR in 6 Indian languages and English Domain : Tourism It supports : Summarization of web documents Snippet translation into query language Temple based information extraction The CLIA system is publicly available at

CLIA Demo

Various Ranking methods Vector Space Model Lucene, Nutch, Lemur, etc Probabilistic Ranking Model Classical spark John’s ranking (Log ODD ratio) Language Model Ranking using Machine Learning Algo SVM, Learn to Rank, SVM-Map, etc Link analysis based Ranking Page Rank, Hubs and Authorities, OPIC, etc

Nutch Ranking CLIA is built on top on Nutch – A open source web search engine. It is based on Vector space model

Link analysis Calculates the importance of the pages using web graph Node: pages Edge: hyperlinks between pages Motivation: link analysis based score is hard to manipulate using spamming techniques Plays an important role in web IR scoring function Page rank Hub and Authority Online Page Importance Computation (OPIC) Link analysis score is used along with the tf-idf based score We use OPIC score as a factor in CLIA.

Learning a ranking function How much weight should be given to different part of the web documents while ranking the documents? A ranking function can be learned using following method Machine learning algorithms: SVM, Max-entropy Training A set of query and its some relevant and non-relevant docs for each query A set of features to capture the similarity of docs and query In short, learn the optimal value of features Ranking Use a Trained model and generate score by combining different feature score for the documents set where query words appears Sort the document by using score and display to user

Extended Features for Web IR 1. Content based features – Tf, IDF, length, co-ord, etc 2. Link analysis based features – OPIC score – Domains based OPIC score 3. Standard IR algorithm based features – BM25 score – Lucene score – LM based score 4. Language categories based features – Named Entity – Phrase based features

Content based Features

Details of features Feature NoDescriptions 1Length of body 2length of title 3length of URL 4length of Anchor 5-14C1-C10 for Title of the page 15-24C1-C10 for Body of the page 25-34C1-C10 for URL of the page 35-44C1-C10 for Anchor of the page 45OPIC score 46Domain based classification score

Details of features(Cont) Feature NoDescriptions 48BM25 Score 49Lucene score 50Language Modeling score Named entity weight for title, body, anchor, url 55-58Multi-word weight for title, body, anchor, url 59-62Phrasal score for title, body, anchor, url 63-66Co-ord factor for title, body, anchor, url 71Co-ord factor for H1 tag of web document

Experiments and results MAP Nutch Ranking DIR with Title + content DIR with URL+ content DIR with Title + URL + content DIR with Title+URL+content+anchor DIR with Title+URL+ content + anchor+ NE feature

Thanks