TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Practice
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
Information Retrieval
Unit 3 Web Search Engines. Can You Find the Answers? n Connect to Google Google n Search for items on Iran Records ________ n Combine Iran with nuclear.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Chapter 5: Information Retrieval and Web Search
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Search Engines and Information Retrieval Chapter 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Web- and Multimedia-based Information Systems Lecture 2.
Search Engines By: Faruq Hasan.
Information Retrieval
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Fred Dirkse CEO, OIC Group, Inc.
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen

Background Challenges Goal Contributions Our Approach Query Processing Evaluation Conclusion 2

TREC Web track. One Billion Pages (10 Languages). 25 TB Uncompressed. 5 TB compressed. 500,000,000 Pages (A). 50,000,000 Pages (B). Submitted systems: Microsoft Research, Yahoo Group, Google team, University of Glasgow, University of Waterloo, University of Ottawa, University of Delware, University of California, University of Maryland, University of Twente, Carnegie Mellon University, University Melbourne, University of Amsterdam, York University, University of Otago, University of Massachusetts, Queensland University group, Chinese Academic group, Hungarian Academic, Centrum Wiskunde, University Dublin, University London, SIFT Project, etc. 3

The huge growth of Internet from 1995 till now - millions. Lack of clear topic boundaries in most Web documents. Lack of clear topic boundaries in most user queries. Many of the relevant topics are available as subtopics or semantically similar with other topics in the same documents. Search results cannot satisfy all users‘ points of views. Spam documents have impact in web search engines. Home page and entity finding queries require extra efforts and different algorithms than regular search algorithms. 4

IR is different from WIR because the environment is dynamic and highly diverse, information is often added, updated, or it becomes unavailable. The Web keeps growing and it becomes more complex; similarly, the queries become more complex, too. Some sites do not have any credibility in their contents. There are few popular sites that provide connectivity and engagement between popular sites in a social manner, e.g. Wikipedia. Wikipedia seeks to create a summary of all human knowledge in the form of an online encyclopedia. Wikipedia intends only to convey knowledge that is already established, recognized, and rarely changed. Content in Wikipedia is subject to the copyright laws of the United States. Wikipedia is the sixth-most-popular website worldwide according to “Alexa Internet” receiving more than 2.7 billion U.S. page views every month. 5

Improving retrieval effectiveness from Web data. Exploiting the query structure. Adapting index structure which is capable of retrieving results for different types of queries. We proposed a novel kind of index structure (centralized) that exploits human knowledge accumulated and integrated in Wikipedia for indexing Web content. We proved the importance of term impact for documents weighting over other documents measures (e.g., tf, tf/idf, etc). We proposed alternative ways of query normalization and expansion by using Wikipedia. 6

We proposed a collection of phrasal indexing algorithms that are suitable for any length and any type of queries. We showed the correlation between the topics available in different articles in Wikipedia. We proposed a novel search model that adapted the global server locally in one computer. We proposed a search model that able to index and answer the query fast. 7

Our Index Structure 8

Using Home Pages (from Wikipedia external links). Using other relevant pages (from Wikipedia external references). Using the connectivity between documents for making query expansion. Finding related topics for queries that are difficult for indexing, e.g., “ to be or not to be that is the question ”. 9

10% of English repository (B) (~5 million documents) ~50% of documents are sharing the same contents but titled differently. ~50% of documents are article types; while others are as short definitions. Our indexing has removed the short articles (by using threshold) as well as grouped similar and long articles by: 1- Using CRC Using common tags. 3- Using terms impact => for retrieving initial results Titles, terms, external links, and other related texts; such as anchor, are indexed in Wikipedia index class. 10

1- Using Domain Name: Indexing all terms that available in the main domains. 1- single word, e.g. diana  main domain : code)  subdomain: diana.???.com, diana.???.gov, diana.???.org, diana.???.edu, diana.???.(country code). 3- Two words or more, e.g. princess diana  ?, diana, princess.diana.??, diana.princess.?? diana > All terms in the titles that referred to the domains above have been indexed. 2- Using Wikipedia External @"(( + @"((

1- The abbreviation or combined terms used in the urls could be recognized from the keywords in the titles. 2- Segmenting the titles of documents into phrases: "or", "and", "at", "in", "on", "by", "with", "from", or "for"; or a punctuation characters, i.e., ":", "|", "(", ")", "-", ",", or "&“. 3- Measuring the impact of phrases in their document’s contents. 4- Phrases with high impact score have been used for building and naming the index nodes; otherwise they were discarded (threshold). 5- The impact of each phrase is computed by using the cosine similarity between two vector, the first vector is the extracted phrases; where the second vector is the document content. 12

1- Terms in the urls and titles referred to the document’s keywords. 2- At least one term from each user’s query is shared with the keywords above; whereas other terms are available in the document’s contents. site, [Impact,t1,f1;t2,f2;……tn,fn] uottawa, [Impact,t1,f1;t2,f2;…tn,fn] diana, [Impact,t1,f1;t2,f2;……tn,fn] inkpen, [Impact,t1,f1;t2,f2;……tn,fn] 13

14 1- Not all documents hold important keywords in the urls and/or titles. 2- Some documents hold keywords only in the content (subtopics and sometimes each topic is different from others. 3- Some documents hold primitive phrases (available once in the content). 4- This class of index uses collection of strings: queries from one-million query track and titles from Wikipedia. 5- The system scanned through the content of our Web collection looking for list of strings above. 6- The captured strings from each document were ranked according to their impact in each document’s content. 7- Topical phrases validation and weighting (based on impact in each document’s content and idf in documents classified in the same topic),

15

16

1- QE is important to make results more diverse. 2- QE is necessary if the first result list is short. 3- QE expansion is used only with the diversity topics/queries. 4- The terms that used for expanding the original query terms are extracted from Wikipedia articles (connectivity). 5- Best QE if query matches Wikipedia topic literally, and long article.  Using Shared-Links.  Using Titling Variation Aspect. 1- Lipomatosis 2- Fatty -Tumor 3- Lipomatous-Neoplasm 17

1.Home pages (".com", ".gov", ".org", ".edu", ".net",.., etc.). 2.Wikipedia results whose titles matched the query literally. 3.Site Preferences ("about.com“, "answers.com“,..etc). 4.Top ten results that ranked high, regardless of the type of sites. 5.Other Wikipedia results that ranked high based on their contents. 6.Other results. Example _phoenix Home Pages (for adhoc task) User Preferences (for diversity task) Other Pages (for adhoc and diversity tasks) 18

Relevant judgment file builds by professional assessors and includes relevant results for each query. Best results were selected from the best data set (A and B). If results are available in subset B and not available in the relevant judgment, means the corresponding results in set A are more relevant. The relevancy degree of each result is based on users’ point of view. 19

20

21

22

23

24

The index classes are working cooperatively. Eliminating one class from the index does not necessarily affect the final precision because the same results may retrieve from other classes. Eliminating one class from the index may increase the overall precision for set of queries but for a specific query may not (that is why we used all classes). Wikipedia has more impact than other classes. The impact of each class is based on the type of query. 25

26

27

 Fast indexing and retrieving method.  Efficient method for all types of queries.  Centralized index (one server system).  Wikipedia is a typical content for home page finding, web indexing, and query expansion.  Each query must pass through all index classes during the query search; then the type of query must be determined.  The ordering (distributing) documents in the final list is not related to document weightings only, but also to the type of query (navigational, informational, transactional).  Dataset subset B (50 million) is enough for training and testing Web search engine for retrieving the relevant documents. 28

Testing our system by using more queries. Displaying the results in an efficient way since our system is centralized. Using other resources rather than only Wikipedia and ALEXA. Indexing real-time data from social resources such as Twitter and Facebook. Using GUI for displaying our results instead of plain and simple text. 29

30 Questions? to or Demo: