Introduction Searching on the World Wide Web Common search tools include Google, Yahoo Traditional Approach Keyword Query based Need to specify your information needs by giving relevant keywords Prone to errors! Question! What do I do if I dont know exactly what I am looking for?
Introduction Another Way… –Use URL as search input instead of a phrase of text e.g. www.nytimes.comwww.nytimes.com What are the requirements? – Fast – High precision – Little input data
Introduction How does it work? - Web graph structure –Proposed two algorithms: Companion Derived from HITS (Hyperlink Induced Topic Search ) algorithm proposed by Kleinberg for ranking search queries. Makes use of weights, hub and authority scores. Co-citation Finds pages that are frequently co-cited with an input URL u. Sites A,B,C Sites X,Y,Z u Found X,Y,Z
Companion Algorithm Takes in a starting URL u as input e.g. www.awebsite.com www.awebsite.com Made up of 4 steps: –Building the vicinity graph of u –Contract duplicates and near-duplicates in the graph –Compute edge weights based on host to host connection –Compute a hub score and a authority score for each node in the graph and return the top ranked authority nodes
Companion Algorithm Uses 5 values* to help determine relevant pages: Go Back (B): How many parent sites the website has i.e. going from u 1 to p 1 Back-Forward (BF): How many child sites the parent has i.e. going from u 1 to p 2 then to u 2 (or u 1 ) Forward (F): How many children the site has (pages it links to) i.e. u 1 to c 1 Forward-Back (FB): How many parent sites the children have i.e. u 1 to c 1 to u 3 STOP list: websites considered not to be relevant to the pages content p1p1 u1u1 c1c1 p2p2 u2u2 c2c2 hyperlinks u3u3 STOP List: http://validator.w3.org/check?uri=referer www.microsoft.com/ie/dowload.html www.yahoo.com *These values are determined before the algorithm is executed A Web-Graph website
Companion Algorithm Step 1 – Building the vicinity graph of u If u is part of the STOP list then it is ignored, otherwise all other sites in the list will be ignored p1p1 c1c1 p2p2 u2u2 c2c2 u3u3 Vicinity graph after step 1
Step 2 – Eliminate any duplication –If one of the nodes (website) in the graph has 10 or more links plus has 95% of it links common to another node* Combined the links from both nodes (union) to create one node –This is to remove sites that are likely to be the same (e.g. mirror sites, or same site under different names) Step 3 – Assign Edge Weights –If two nodes are on the same host then the edge between them will be set to zero –If there are k links going to one site (i.e. many-to-one), the node edges authority weight are set to 1/k –If there are multiple links L from one site (i.e. one-to-many), the node edges hub weight are set to 1/L The vicinity graph of u has now been constructed! *This clearly has its problems!!!
Companion Algorithm Step 4 – Compute Hub and Authority scores Nodes (websites) with a high authority score are expected to have relevant content Nodes with a high hub score are expected to contain links to relevant content The 10 highest authority scoring nodes are then returned as relevant pages to the starting URL u
Co-citation Algorithm Two sites are co-cited if they have a common parents e.g. u 3 and u 1 are co-cited by p 1 Degree of co-citation (DoC) is the number of common parents a site has e.g. u 3 and u 1 have a DoC of 2 The algorithm finds the sibling of a site, computes their DoC and returns the top 10 sites with the highest DoC If number of siblings of u < 15 and DoC of u < 2 then algorithm restarts with a URL one level up from the original e.g. If u = a.com/X/Y/Z then new u = a.com/X/Y p1p1 u1u1 p2p2 u2u2 u3u3 Siblings of u 1
Netscapes Approach What's Related function Not a lot of detail mentioned in the paper! Gets similar pages from web crawling, archiving, categorising and data mining (as opposed to just using the web graph like the previous algorithms) Also tries to learn from trends (i.e what user click on after they searched for a keyword)
Implementation Compaqs Connectivity Server –Provides 180 million URL (nodes) Multi-threaded server to take in URLs –Uses either the Companion or Cocitation algorithm to find related pages.
Evaluation Studies carried out to determine the performance of these algorithms. Benchmark against Netscapes approach. Re-visit initial requirements. –Speed –Precision –Little Data Input – already achieved
Evaluation Speed –109 milliseconds for Companion, and 195ms for Cocitation. –Complexity of the Cocitation algorithm is in the order of O(n log n). Precision
Critique Faults within HITS not investigated. Nomura, Satoshi, and Hayamizu, Analysis and Improvement of HITS Algorithm for Detecting Web Communities, show some of the problems with the algorithm. Requires the user to have found something relevant to what they are looking for. i.e. I have found NYTimes, I want to have a look at what alternatives are available. Can it handle the scale of the web today? Tested with 180 million connectivity information. Indexable web size stands at over 11 billion Links to friends web pages that are non-relevant to the input URL will be taken into account, consider the size of the web today, this may lead to bad results. Small, specialised population used in test, lack of general approach. 'Two click away' idea not the case today.
Critique Looking at the positives The algorithms used indeed outperform Netscapes algorithm for finding related pages, and can be extended to handle more than one input URL* Easy to implement Many papers were consulted and used during the process of writing and implementing the work. *at the time (1999)
Applications and Future Work Data Mining - Web Structure Mining –Finding authoritative Web pages Classifying Web documents –Exploring Co-cited material, if they are linked, they could have relevance, if one is pointed to, it could be important. Extend the algorithm to increase the heuristic and look beyond the 'two click away idea'. Lack of further work because the assumption was so unrealistic to today's standards
Conclusion Suggested a solution to deal with the problem of searching for a topic that can not be easily expressed in simple text query. Companion and Co-citation algorithms are fast ways of doing search that is different to traditional text queries. Obtained a solution that can be easily adapted and implemented into web servers.
Q & A Any Questions?
References Hyperlink structure of the Web G.O. Arocena, A.O. Mendelzon and G.A. Mihaila, Applications of a web query language, in: Proc. Of the Sixth International World Wide Web Conference. Chakrabarti et al., Enhanced Hypertext Categorisation using Hyperlinks, in which links and their orders are used to categorise Web pages. E. Spertus, ParaSite: Mining Structural Information on the Web, also suggested using cocitation and other forms of connectivity to identify related Web pages Authoritative Sources in a Hyperlinked Environment. The HITS algorithm is used as a starting point for the companion algorithm, which is extended and modified. Linkage Similarity Measures for the Classification of Web Documents, P'avel Calado, Marco Cristo, Marcos Andr'e Gon calves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Web Mining – A Bird's eye view, presentation by Sanjay Kumar Madria