CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Lucene & Nutch Lucene  Project name  Started as text index engine Nutch  A complete web search engine, including: Crawling, indexing, searching  Index.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 19: Information Retrieval
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.
Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Querying Structured Text in an XML Database By Xuemei Luo.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Search Engines By: Faruq Hasan.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Post-Ranking query suggestion by diversifying search Chao Wang.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Wikitology Wikipedia as an Ontology
Information Retrieval
Crawling Ida Mele.
Data Mining Chapter 6 Search Engines
Introduction to Nutch Zhao Dongsheng
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR

Road Map Cross Lingual IR Motivation CLIA architecture CLIA demo Ranking Various Ranking methods Nutch/lucene Ranking Learning a ranking function Experiments and results

Cross Lingual IR Motivation Information unavailability in some languages Language barrier Definition: Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (wikipedia) Example: A user may ask query in Hindi but retrieve relevant documents written in English.

4 Why CLIR? Query in Tamil English Document System Marathi Document search English Document Snippet Generation and Translation

Cross Lingual Information Access Cross Lingual Information Access (CLIA) A web portal supporting monolingual and cross lingual IR in 6 Indian languages and English Domain : Tourism It supports : Summarization of web documents Snippet translation into query language Temple based information extraction The CLIA system is publicly available at

CLIA Demo

Various Ranking methods Vector Space Model Lucene, Nutch, Lemur, etc Probabilistic Ranking Model Classical spark John’s ranking (Log ODD ratio) Language Model Ranking using Machine Learning Algo SVM, Learn to Rank, SVM-Map, etc Link analysis based Ranking Page Rank, Hubs and Authorities, OPIC, etc

Nutch Ranking CLIA is built on top on Nutch – A open source web search engine. It is based on Vector space model

Link analysis Calculates the importance of the pages using web graph Node: pages Edge: hyperlinks between pages Motivation: link analysis based score is hard to manipulate using spamming techniques Plays an important role in web IR scoring function Page rank Hub and Authority Online Page Importance Computation (OPIC) Link analysis score is used along with the tf-idf based score We use OPIC score as a factor in CLIA.

Learning a ranking function How much weight should be given to different part of the web documents while ranking the documents? A ranking function can be learned using following method Machine learning algorithms: SVM, Max-entropy Training A set of query and its some relevant and non-relevant docs for each query A set of features to capture the similarity of docs and query In short, learn the optimal value of features Ranking Use a Trained model and generate score by combining different feature score for the documents set where query words appears Sort the document by using score and display to user

Extended Features for Web IR 1. Content based features – Tf, IDF, length, co-ord, etc 2. Link analysis based features – OPIC score – Domains based OPIC score 3. Standard IR algorithm based features – BM25 score – Lucene score – LM based score 4. Language categories based features – Named Entity – Phrase based features

Content based Features

Details of features Feature NoDescriptions 1Length of body 2length of title 3length of URL 4length of Anchor 5-14C1-C10 for Title of the page 15-24C1-C10 for Body of the page 25-34C1-C10 for URL of the page 35-44C1-C10 for Anchor of the page 45OPIC score 46Domain based classification score

Details of features(Cont) Feature NoDescriptions 48BM25 Score 49Lucene score 50Language Modeling score Named entity weight for title, body, anchor, url 55-58Multi-word weight for title, body, anchor, url 59-62Phrasal score for title, body, anchor, url 63-66Co-ord factor for title, body, anchor, url 71Co-ord factor for H1 tag of web document

Experiments and results MAP Nutch Ranking DIR with Title + content DIR with URL+ content DIR with Title + URL + content DIR with Title+URL+content+anchor DIR with Title+URL+ content + anchor+ NE feature

Crawling, Indexing

Outline Nutch Overview Crawler in CLIA system Data structure Crawler in CLIA Indexing Types of index and indexing tools Searching Command line API Searching through GUI Demo

Crawler The crawler system is driven by the Nutch crawl tool, and a family of related tools to build and maintain several types of data structures, including the web database, a set of segments, and the index.

Crawler Data Structure Web Database (webdb)  persistent data structure for web graph being crawled.  Stores pages and links Segment  A collection of pages fetched and indexed by the crawler in a single run Index  Inverted index of all of the pages the system has retrieved

Crawler Initial URLs GeneratorFetcher Segment Webpages/files Web Parser generate Injector CrawlDB read/write CrawlDBTool update get read/write

Crawl command Aimed for intranet-scale crawling A front end to other, lower-level tools It performs crawling and indexing Create a URLS directory and put URLs list in it. Command  $NUTCH_HOME/bin/nutch crawl urlDir [Options]  Options -dir: the directory to put the crawl in. -depth: the link depth from the root page that should be crawled. -threads: the number of threads that will fetch in parallel. -topN: number of total pages to be crawled.  Example bin/nutch crawl urls -dir crawldir -depth 3 -topN 10

Inject command Inject root URLs into the WebDB Command  $NUTCH_HOME/bin/nutch inject : Path to the Crawl Database directory : Path to the directory containing flat text url files

Generate command Generates a new Fetcher Segment from the Crawl Database Command:  $NUTCH_HOME/bin/nutch generate [-topN ] [-numFetchers ]  : Path to the crawldb directory. : Path to the directory where the Fetcher Segments are created. [-topN ]: Selects the top ranking URLs for this segment [-numFetchers ]: The number of fetch partitions.

Fetch command Runs the Fetcher on a segment Command : $NUTCH_HOME/bin/nutch Fetch [-threads ] [- noParsing] : Path to the segment to fetch [-threads ]: The number of fetcher threads to run [-noParsing]: Disables automatic parsing of the segment's data

Parse command Runs ParseSegment on a segment. Command  $NUTCH_HOME/bin/nutch parse  : Path to the segment to parse.

Updatedb command Updates the Crawl DB with information obtained from the Fetcher Command :  $NUTCH_HOME/bin/nutch updatedb  : Path to the crawl database. : Path to the segment that has been fetched.

Index and Indexing Sequential Search is bad (Not Scalable) Indexing – the creation of a data structure that facilitates fast, random access to information stored in it. Types of Index  Forward Index  Inverted Index  Full Inverted Index

Forward Index It stores a list of words for each documents Example D 1 =“it is what it is.” D 2 =“what is it.” D 3 =“it is a banana” DocumentWords 1It, is, what 2What, is, it 3It, is, a, banana

Inverted Index It stores a list of documents for each word WordDocuments a3 banana3 is1,2,3 it1,2,3 What1,2

Full Inverted Index It is used to support phrase search. Query: “What is it” WordDocuments a{(3.2)} banana{(3,3)} is{(1,1),(1,4),(2,1),(3,1)} it{(1,0),(2,2),(3,0)} What{(1,2),(2,0)}

Invertlink command Updates the Link Database with linking information from a segment Command:  $NUTCH_HOME/bin/nutch invertlink (-dir segmentsDir | segment1 segment2...)  : Path to the link database. : Path to the segment that has been fetched. A directory or more than one segment may be specified.

Index command Creates an index of a segment using information from the crawldb and the linkdb to score pages in the index Command :  $NUTCH_HOME/bin/nutch index...  : Path to the directory where the index will be created : Path to the crawl database directory : Path to the link database directory : Path to the segment that has been fetched More then one segment may be specified

Dedup command Removes duplicate pages from a set of segment indexes Command:  $NUTCH_HOME/bin/nutch dedup : Path to directories containing indexes

Merge command Merges several segment indexes Command:  $NUTCH_HOME/bin/nutch merge...  : Path to a directory where the merged index will be created. : Path to a directory containing indexes to merge. More then one directory may be specified.

Configuring CLIA crawler Configure file: $NUTCH/conf/nutch-site.xml  Required user parameters  http.agent.name  http.agent.description  http.agent.url  http.agent.  Optional user parameters  http.proxy.host  http.proxy.port

Configuring CLIA crawler Configure file: $NUTCH/conf/crawl-urlfilters.txt Regular expression to filter URLs during crawling E.g. To ignore files with certain suffix: -\.(gif|exe|zip|ico)$ To accept host in a certain domain +^ change the following line 26 line of crawl-urlfileters.txt #skip everything else +.

Searching and Indexing Indexer (Lucene) Segments Index Searcher (Lucene) GUI CrawlDBLinkDB (Tomcat)

Crawl Directory Structure Crawldb  Contains the information about every URL known to Nutch Linkdb  contains the list of known links to each URL Segment crawl_generate names a set of urls to be fetched crawl_fetch contains the status of fetching each url content contains the content of each url parse_text contains the parsed text of each url parse_data contains outlinks and metadata parsed from each url crawl_parse contains the outlink urls, used to update the crawldb Index Contains Lucene-format indexes.

Searching Configure file: $NUTCH/conf/nutch-default.xml Change the following property: searcher.dir – complete path to you crawl folder Command line searching API $NUTCH_HOME/bin/nutch org.apache.nutch.searcher.NutchBean queryString

Searching Create clia-alpha-test.war file using “ant war” Deploy clia-alpha-test.war file in tomcat webapp directory

Thanks