Presentation is loading. Please wait.

Presentation is loading. Please wait.

WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.

Similar presentations


Presentation on theme: "WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure."— Presentation transcript:

1 WEDAGEN: A Synthetic Web Database Generator

2 Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure of WEDAGEN l Configuration parameters l Performance evaluation l Summary and future work

3 Existing W 3 Search Mechanisms l Time delay in manual navigation of the web l Overwhelming results and unwanted information l No tool for organizing and storing harnessed information for further manipulation

4 Existing W 3 Search Mechanisms l Time delay in manual navigation of the web l Overwhelming results and unwanted information l No tool for organizing and storing harnessed information for further manipulation l Search engines and browsers are not always the best ways to systematically harness information from the web

5 Existing W 3 Search Mechanisms l Time delay in manual navigation of the web l Overwhelming results and unwanted information l No tool for organizing and storing harnessed information for further manipulation l Search engines and browsers are not always the best ways to systematically harness information from the web l The WHOWEDA approach @ NTU

6 Overview of WHOWEDA l A web warehousing system to store and manipulate web information l Store extracted information as ‘web tables’ and provide ‘web operators’ to manipulate web tables l To extract information from W 3, user defines a ‘query graph’ l Results of extraction is a set of web tuples; each tuple instantiates the query graph l More information: u http://www.cais.ntu.edu.sg:8000/~whoweda

7 Example: Query graph (web schema) N1.URL EQUALS “http://sunsite.doc.ic.ac.uk/ bySubject/Computing/ UniSciDepts.html” L2.LABEL EQUALS “faculty” L3.LABEL EQUALS “research projects” L4.LABEL CONTAINS “publications” L5.LABEL CONTAINS “publications” N5.TEXT CONTAINS “Internet computing” N1 N2 N3 N4 N5

8 Example: Query results Id Name Age A1 John 23 C2 Wendy 35 B4 Jane 25 A2 Wendy 35 C9 Pete 42 B3 Kim 38 F8 Tom 22 G7 Cindy 47

9 Objectives l Need to perform systematic evaluation of web operators during WHOWEDA development l Limitations of testing using real web data l To design a testbed that is controllable, comprehensive and systematic for evaluating web database systems l To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas

10 Objectives l Need to perform systematic evaluation of web operators during WHOWEDA development l Limitations of testing using real web data l To design a testbed that is controllable, comprehensive and systematic for evaluating web database systems l To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas l WEBAGEN: A Web Database Generator

11 System Architecture of WEDAGEN

12 Configuration Input Parameters WEDAGEN parameters DefaultSpecific Selectivity Instance Related Control NumTuples NumSourceNodeInstances FanOut NumKeyWordsPerNodeInstance NumWordsPerNodeInstance NumWordsPerLinkLabel NumWordsPerHostName NumWordsPerTitle LocalGlobalLink NumSourceNodeInstances FanOut NumKeyWordsPerNodeInstance NumWordsPerNodeInstance NumWordsPerLinkLabel NumWordsPerTitle NumWordsPerHostName LocalGlobalLink NodeSelectivity TableSelectivity Web Schema Fan-In

13 Parameter Values Suggestion Start Generate specific parameter values user change specific parameters Calculate max. no. of tuples to be generated Is calculated value > NumTuples Calculate NumSourceNodeInstances to generate specified number of tuples Store suggested values in file User change specific parameters End Invoke instance generation module

14 Instance Generation Module (IGM) 1. No. of node instance generator Num Source Node Instances Fan out No. of Node Instances per node 2. URL generator 3. Node instance attribute generator 4. Link set generator 5. Web page generator Num words per URL URLs of all node instances Link set of each instance Node attributes e.g. title, text, date Num Source Node Instances Num words per node instance Images web page Num words per title Node Pool Web pages Web tables Tuple Extraction Module

15 Directed Graph Output from IGM

16 Tuple Extraction Module (TEM) l IGM generates all node and link instances interconnected as directed graph(s) l TEM extracts and constructs individual web tuples from the directed graph(s) l Node and link instances have IDs assigned l Web tuples stored in a web table file l A web table has been constructed that is complete with node, link and tuple information

17 Extracted Web Tuples

18 Preliminary Evaluation l Elapsed time used to measure overhead of web table generation l A set of sample test configurations identified consisting of typical combinations of 4 web schemas and input parameters l Performance measured with respect to: u Complexity of schema u Total number of node instances and total number of tuples

19 Four Test Schemas

20 Three Table Sizes

21 Elapsed Time Vs No. of Tuples

22 Experimental Findings l Time elapsed in generating web table increases with size of table l Rate of growth is different for different schemas; i.e., schema complexity affects elapsed time u Generating table of tree schema (schema 2) takes longer than that of linear schema (schema 1) u Generating table of schema 2 takes longer than that of schema 4

23 Summary l Identified parameters to create web data of different sizes and complexities successfully determined l Designed and implemented WEDAGEN and has been successfully integrated into the WHOWEDA system l Able to scale up well with increasing web schema complexity and web table size l Time and effort required to evaluate web database system performance can be reduced with WEBAGEN

24 Future Work l Inclusion of more parameters: u Minimum and maximum depth of a tuple. u Average ratio of bound and unbound nodes in a tuple. l Apply WEDAGEN to other database systems similar to WHOWEDA l Develop WHOWEDA into a full-fledged benchmark toolkit


Download ppt "WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure."

Similar presentations


Ads by Google