TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1

Background Challenges Goal Contributions Our Approach Query Processing Evaluation Conclusion 2

TREC 2009-2012 Web track. One Billion Pages (10 Languages). 25 TB Uncompressed. 5 TB compressed. 500,000,000 Pages (A). 50,000,000 Pages (B). Submitted systems: Microsoft Research, Yahoo Group, Google team, University of Glasgow, University of Waterloo, University of Ottawa, University of Delware, University of California, University of Maryland, University of Twente, Carnegie Mellon University, University Melbourne, University of Amsterdam, York University, University of Otago, University of Massachusetts, Queensland University group, Chinese Academic group, Hungarian Academic, Centrum Wiskunde, University Dublin, University London, SIFT Project, etc. 3

The huge growth of Internet from 1995 till now - millions. Lack of clear topic boundaries in most Web documents. Lack of clear topic boundaries in most user queries. Many of the relevant topics are available as subtopics or semantically similar with other topics in the same documents. Search results cannot satisfy all users‘ points of views. Spam documents have impact in web search engines. Home page and entity finding queries require extra efforts and different algorithms than regular search algorithms. 4

IR is different from WIR because the environment is dynamic and highly diverse, information is often added, updated, or it becomes unavailable. The Web keeps growing and it becomes more complex; similarly, the queries become more complex, too. Some sites do not have any credibility in their contents. There are few popular sites that provide connectivity and engagement between popular sites in a social manner, e.g. Wikipedia. Wikipedia seeks to create a summary of all human knowledge in the form of an online encyclopedia. Wikipedia intends only to convey knowledge that is already established, recognized, and rarely changed. Content in Wikipedia is subject to the copyright laws of the United States. Wikipedia is the sixth-most-popular website worldwide according to “Alexa Internet” receiving more than 2.7 billion U.S. page views every month. 5

Improving retrieval effectiveness from Web data. Exploiting the query structure. Adapting index structure which is capable of retrieving results for different types of queries. We proposed a novel kind of index structure (centralized) that exploits human knowledge accumulated and integrated in Wikipedia for indexing Web content. We proved the importance of term impact for documents weighting over other documents measures (e.g., tf, tf/idf, etc). We proposed alternative ways of query normalization and expansion by using Wikipedia. 6

We proposed a collection of phrasal indexing algorithms that are suitable for any length and any type of queries. We showed the correlation between the topics available in different articles in Wikipedia. We proposed a novel search model that adapted the global server locally in one computer. We proposed a search model that able to index and answer the query fast. 7

Our Index Structure 8

Using Home Pages (from Wikipedia external links). Using other relevant pages (from Wikipedia external references). Using the connectivity between documents for making query expansion. Finding related topics for queries that are difficult for indexing, e.g., “ to be or not to be that is the question ”. 9

10% of English repository (B) (~5 million documents) ~50% of documents are sharing the same contents but titled differently. ~50% of documents are article types; while others are as short definitions. Our indexing has removed the short articles (by using threshold) as well as grouped similar and long articles by: 1- Using CRC16. 2- Using common tags. 3- Using terms impact => for retrieving initial results Titles, terms, external links, and other related texts; such as anchor, are indexed in Wikipedia index class. 10

1- Using Domain Name: Indexing all terms that available in the main domains. 1- single word, e.g. diana  main domain : www.diana.com, www.diana.gov, www.diana.org, www.diana.edu, www.diana.net, www.diana.(country code)www.diana.comwww.diana.gov www.diana.orgwww.diana.eduwww.diana.netwww.diana.(country  subdomain: diana.???.com, diana.???.gov, diana.???.org, diana.???.edu, diana.???.(country code). 3- Two words or more, e.g. princess diana  www.princessdiana.? ?, www.princessdiana, princess.diana.??, diana.princess.??www.princessdianawww.princessdiana > All terms in the titles that referred to the domains above have been indexed. 2- Using Wikipedia External Links. @"((http://.*?)\[\]home page) @"((.*?)\[\]www.*?")www.* @"((http://.*?)\[\].*?official.*?web.*?site) @"((http://.*?)\[\].*?official.*?site) @"((http://.*?)\[\].*?" + query terms+ @".*?\sweb.*?site) @"((.*?)\[\].*?website) @"((http://www.*?)\[\]" + query terms+ @".*?\sweb.*?site) @"((http://.*?)\[\]link) 11

1- The abbreviation or combined terms used in the urls could be recognized from the keywords in the titles. 2- Segmenting the titles of documents into phrases: "or", "and", "at", "in", "on", "by", "with", "from", or "for"; or a punctuation characters, i.e., ":", "|", "(", ")", "-", ",", or "&“. 3- Measuring the impact of phrases in their document’s contents. 4- Phrases with high impact score have been used for building and naming the index nodes; otherwise they were discarded (threshold). 5- The impact of each phrase is computed by using the cosine similarity between two vector, the first vector is the extracted phrases; where the second vector is the document content. 12

1- Terms in the urls and titles referred to the document’s keywords. 2- At least one term from each user’s query is shared with the keywords above; whereas other terms are available in the document’s contents. site, [Impact,t1,f1;t2,f2;……tn,fn] uottawa, [Impact,t1,f1;t2,f2;…tn,fn] diana, [Impact,t1,f1;t2,f2;……tn,fn] inkpen, [Impact,t1,f1;t2,f2;……tn,fn] 13

14 1- Not all documents hold important keywords in the urls and/or titles. 2- Some documents hold keywords only in the content (subtopics and sometimes each topic is different from others. 3- Some documents hold primitive phrases (available once in the content). 4- This class of index uses collection of strings: queries from one-million query track and titles from Wikipedia. 5- The system scanned through the content of our Web collection looking for list of strings above. 6- The captured strings from each document were ranked according to their impact in each document’s content. 7- Topical phrases validation and weighting (based on impact in each document’s content and idf in documents classified in the same topic),

1- QE is important to make results more diverse. 2- QE is necessary if the first result list is short. 3- QE expansion is used only with the diversity topics/queries. 4- The terms that used for expanding the original query terms are extracted from Wikipedia articles (connectivity). 5- Best QE if query matches Wikipedia topic literally, and long article.  Using Shared-Links.  Using Titling Variation Aspect. 1- Lipomatosis 2- Fatty -Tumor 3- Lipomatous-Neoplasm 17

1.Home pages (".com", ".gov", ".org", ".edu", ".net",.., etc.). 2.Wikipedia results whose titles matched the query literally. 3.Site Preferences ("about.com“, "answers.com“,..etc). 4.Top ten results that ranked high, regardless of the type of sites. 5.Other Wikipedia results that ranked high based on their contents. 6.Other results. Example http://www.phoenix.edu/ http://phoenix.edu/ http://axia.phoenix.edu/ http://en.wikipedia.org/wiki/en:University_of_Phoenix http://en.wikipedia.org/wiki/University_of_Phoenix http://phoenix.about.com/library/blseatingcardinals.htm http://wiki.answers.com/q/what_are_some_......_the_university_of _phoenix http://business.phoenix.edu/ http://technology.phoenix.edu/ http://military.phoenix.edu/ http://artsandsciences.phoenix.edu/ http://education.phoenix.edu/ Home Pages (for adhoc task) User Preferences (for diversity task) Other Pages (for adhoc and diversity tasks) 18

Relevant judgment file builds by professional assessors and includes relevant results for each query. Best results were selected from the best data set (A and B). If results are available in subset B and not available in the relevant judgment, means the corresponding results in set A are more relevant. The relevancy degree of each result is based on users’ point of view. 19

The index classes are working cooperatively. Eliminating one class from the index does not necessarily affect the final precision because the same results may retrieve from other classes. Eliminating one class from the index may increase the overall precision for set of queries but for a specific query may not (that is why we used all classes). Wikipedia has more impact than other classes. The impact of each class is based on the type of query. 25

 Fast indexing and retrieving method.  Efficient method for all types of queries.  Centralized index (one server system).  Wikipedia is a typical content for home page finding, web indexing, and query expansion.  Each query must pass through all index classes during the query search; then the type of query must be determined.  The ordering (distributing) documents in the final list is not related to document weightings only, but also to the type of query (navigational, informational, transactional).  Dataset subset B (50 million) is enough for training and testing Web search engine for retrieving the relevant documents. 28

Testing our system by using more queries. Displaying the results in an efficient way since our system is centralized. Using other resources rather than only Wikipedia and ALEXA. Indexing real-time data from social resources such as Twitter and Facebook. Using GUI for displaying our results instead of plain and simple text. 29

30 Questions? Email to falak081@uottawa.ca or diana.inkpen@uottawa.cafalak081@uottawa.cadiana.inkpen@uottawa.ca Demo: http://site.uottawa.ca/~falak081

TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

Similar presentations

Presentation on theme: "TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

Similar presentations

Presentation on theme: "TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1."— Presentation transcript:

Similar presentations

About project

Feedback