Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)

Similar presentations


Presentation on theme: "1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)"— Presentation transcript:

1 1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs) containing those keywords. Search engines attempt to organize and rank information pertaining to users’ search, they fail to group and form relationships between these results (e.g. Excite, Altavista).

2 2 Our goal is to develop an advanced search engine that makes the search results easy to browse by grouping, organizing and relating them. MARG-DARSHAK returns a graph containing a set of nodes and edges, where each node contains one or more URLs and the types of relationships between nodes are associated with each edge.

3 3 Issues: How to group the web pages, how to rank them, how to determine the relationship among web pages, how to ensure the reliability of data Currently, we use meta-search engine for retrieving the initial results.

4 4 User Interface Meta-search Engine Grouping of Results Ranking Web Database Relationships Results Display Web QL

5 5 Reliability of Data in WWW Information about a particular topic is available at many web sites. How do we know that the information provided on the web is reliable, up-to-date and accurate?

6 6 Evaluation Criteria Accuracy Coverage Currency Ownership Objectivity Authority Search Engine

7 7 Methods to Ensure Reliability ‘Last Update’ Method ‘Majority Basis’ Method ‘Polling’ Method ‘Query Driven’ Method ‘Home (Official) Site’ Method

8 8 User Interface To accept user input To display result To refine user input

9 9 Web Query Language To select the data pertaining to the user’s query To eliminate redundant data To manipulate web data

10 10 Grouping Documents Search engine returns several web sites that contain the same information. This information needs to be grouped based on keywords, and similarity measure. (Sub) Keywords are provided by the users or an ontology can be used.

11 11 Grouping based on phrases common to many documents; we use suffix tree clustering algorithm (O. Zamir and O. Etzioni, “We Document Clustering”, SIGIR’98). This algorithm has three steps: (1) Document cleaning (2) Identify document base(s) (3) Merge these document base(s)

12 12 A suffix tree is created from plain documents where each document is treated as a string. Strings are decomposed into words. Each leaf node contains a list of all the documents that contain the concatenation of all the strings from root node to that leaf node.

13 13 Merge these ‘document base(s) (leaf nodes) into larger group based on the ratio of their intersecting documents to the total number of documents that they contain. Two document bases B1 and B2 are merged if they satisfy the following two conditions: (1) |B1  B2|/|B1| > m and (2) |B1  B2|/|B2| > m, where 0  m  1 is merging threshold value.

14 14 The shared phrases of a group provide an informative way of summarizing its contents to the user. To identify the redundant phrases, we adopt ‘coverage’ method; that is, the domain of the user’s topic of interest.

15 15 Ordering Documents groups Based on their relevance to the query -prior knowledge of the domain of the search problem is must Ordering can be done within a group and among groups Indexing can be done on documents based on the occurrence of keywords

16 16 Defining Relationships between Documents Information about a topic is scattered on the web, there exits some relationship among web pages based on the contents of pages. Relationships can be defined on static web pages. Examples: Similar-To, Example-Of, Next-To, Previous-To, Derived-From, Same-As, Part-Of

17 17 Similar-To Two documents are similar to each other if they have the same semantic meaning of the documents. Document Analysis work can be adopted to find this relationship; based on common similar words, occurrences of these words in the same order, similarity measure.

18 18 Example-Of Find the presence of ‘e.g.’ or ‘example of’ or ‘explains’ etc before or after certain keywords. A web page u will be an example of web page v if there is at least m references from v to u.

19 19 Next-To, Previous-To Next-To: Search words like ‘refer to’, ‘further reading’, ‘more information’ etc Previous-To: Hard to determine such relationship because it is difficult to know which documents have link to a given web page.


Download ppt "1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)"

Similar presentations


Ads by Google