Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru

Similar presentations


Presentation on theme: "Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru"— Presentation transcript:

1 Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru dnm8925@cacs.louisiana.edu http://www.ucs.louisiana.edu/~dnm8925 Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu Jayasimha R. Katukuri Saygin Celebi Laboratory for Internet Computing Center for Advanced Computer Studies University of Louisiana at Lafayette, Lafayette, LA

2 2 Agenda Introduction Proposed Solution: Path-based Information Extractor Experiments Conclusions and Future Work

3 3 Introduction World Wide Web: Largest known repository of documents containing diverse content used by people from diverse backgrounds. Few characteristics of Web include: Huge size Easily accessible Hyperlinked Dynamic Diverse coverage – science, politics, education, etc. Increasing at a tremendous rate Noisy - advertisements, mirror sites, etc.

4 4 Web Mining: Leverage the Value of Web Web mining aims to discover useful knowledge from the Web Characteristics of Web such as heterogeneity, increasing size, noise, etc. makes Web mining a challenging task Web mining can be classified into [Kosala 00, Liu 04]:  Web content mining: Extracting and discovering useful information or knowledge from Web page contents  Web structure mining: Discovering useful knowledge from the structure of hyperlinks e.g., used by Google  Web usage mining: Discovering useful knowledge from user access log files e.g., used by Amazon.com Web mining is a multidisciplinary field:  Data mining, information retrieval, databases, machine learning, information extraction, natural language processing, etc.

5 5 Web Mining & Web Content Mining Classification

6 6 Structured Data Extraction Structured data extraction deals with extracting information displayed in a regular structure as such information is perceived to represent the essential content in a Web page e.g., list of products in an e-commerce Web page. [Liu 04] Few example applications:  Online comparative shopping engines (e.g., nextag.com)  Metasearch engines (e.g., dogpile.com)  Modern Business Intelligence systems (e.g., intelliseek.com)

7 7 Sample response page from Google

8 8 Sample response page from drugstore.com

9 9 Path-based Information Extractor (PIE) PIE is an automatic data extraction system whose goal is to automatically extract data records present in Web search response pages. [Mundluru 05a, Mundluru 05b] PIE also eliminates any “noisy” content such as advertisements, navigation links, etc.

10 10 Few Observations Observation 1: Data records displayed in a particular region of a Web page are contiguous and are formatted using similar HTML tags. [Liu 03] Observation 2: A group of similar data records belonging to a particular region are always present under the same parent node in the tag tree. [Liu 03] Observation 3: Every record present in most search response pages has at least one hyperlink. Usually, title of the retrieved document is displayed in the form of a hyperlink, which points to the retrieved document. In this work, we refer to such a hyperlink as a record link.

11 11 Record Extraction Algorithm

12 12 Experiments Experiment Setup: Evaluated the proposed system by comparing it with two state-of-the-art record extraction systems: MDR [Liu 03] and ViNTs [Zhao 05] All three systems were tested on a total of 60 Web pages (having 873 data records) taken from 60 Web sources The 60 Web sources include:  general-purpose search engines e.g., Google, Yahoo  e-commerce sites e.g., drugstore.com, clevershoppers.com  other special-purpose search engines e.g., mit.edu, breastcancer.org PIE was developed in Java

13 13 Experiments Evaluation Measures Used :  Recall = Total number of target data records correctly extracted Total number of target data records  Precision = Total number of target data records correctly extracted Total number of data records extracted Results: PIEMDRViNTs Recall 90.4%69.9%83.8% Precision95.5%81.4%93%

14 14 Conclusions & Future Work Conclusions:  Automatic data extraction is extremely important for systems such as online comparative search engines, metasearch engines, business intelligence solutions, etc.  A very effective system called PIE has been proposed for automatically extracting data records from Web pages.  Experiments showed that PIE outperformed MDR and ViNTs, which are two state-of-the-art record extraction systems that are being used in two software companies. Future Work:  Improving the effectiveness in extracting records  Extracting attributes in each data record e.g., product name, price, etc.  Performing large-scale experiments  Building applications such as online comparative shopping engines, metasearch engines, etc.

15 15 References [Mundluru 05a] D. Mundluru, J. Katukuri, and S. Celebi. Automatically Mining Result Records from Search Engine Response Pages. Proceedings of 5th IEEE International Conference on Data Mining (ICDM), 749 – 753, Houston, November 2005. [Mundluru 05b] D. Mundluru, Z. Wu, V. Raghavan, J. Katukuri, and S. Celebi. Automatically Mining Search Result Records. Technical Report CACS-TR-2005-3-1, Center for Advanced Computer Studies, University of Louisiana at Lafayette, 2005. [Kosala 00] R. Kosala and H. Blockeel. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2(1), 1-15, 2000. [Liu 04] B. Liu and K. Chang. Editorial: Special Issue on Web Content Mining. SIGKDD Explorations, 6(2), 1-4, December 2004. [Liu 03] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 601-606, Washington, D.C., August 2003. [Zhao 05] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. Proceedings of the 14th International World Wide Web Conference, 66-75, Chiba, Japan, May 2005.


Download ppt "Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru"

Similar presentations


Ads by Google