Holistic Web Page Classification William W. Cohen Center for Automated Learning and Discovery (CALD) Carnegie-Mellon University.

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.

Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.

Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.

Information Retrieval in Practice

 Firewalls and Application Level Gateways (ALGs)  Usually configured to protect from at least two types of attack ▪ Control sites which local users.

Text Classification With Support Vector Machines

Text Learning Tom M. Mitchell Aladdin Workshop Carnegie Mellon University January 2003.

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Distributed Representations of Sentences and Documents

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Overview of Search Engines

For Better Accuracy Eick: Ensemble Learning

Ensembles of Classifiers Evgueni Smirnov

Machine Learning CS 165B Spring 2012

Webpage Understanding: an Integrated Approach

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.

Universit at Dortmund, LS VIII

EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.

Group Sparse Coding Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow Google Mountain View, CA (NIPS2009) Presented by Miao Liu July

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.

3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.

Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.

IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.

Learning TFC Meeting, SRI March 2005 On the Collective Classification of “Speech Acts” Vitor R. Carvalho & William W. Cohen Carnegie Mellon University.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Semi-automatic Product Attribute Extraction from Store Website

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Classification using Co-Training

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection.

Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.

2.1 Functions and their Graphs Standard: Students will understand that when a element in the domain is mapped to a unique element in the range, the relation.

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Data mining in web applications

Information Retrieval in Practice

Semi-Supervised Clustering

School of Computer Science & Engineering

IE by Candidate Classification: Califf & Mooney

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Presentation transcript:

Holistic Web Page Classification William W. Cohen Center for Automated Learning and Discovery (CALD) Carnegie-Mellon University

Outline Web page classification: assign a label from a fixed set (e.g “pressRelease, other”) to a page. This talk: page classification as information extraction. –why would anyone want to do that? Overview of information extraction –Site-local, format-driven information extraction as recognizing structure How recognizing structure can aid in page classification

foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: FL-Deerfield Beach ContactInfo: DateExtracted: January 8, 2001 Source: OtherCompanyJobs: foodscience.com-Job1

Two flavors of information extraction systems Information extraction task 1: extract all data from 10 different sites. –Technique: write 10 different systems each driven by formatting information from a single site (site-dependent extraction) Information extraction task 2: extract most data from 50,000 different sites. –Technique: write one site-independent system

Extracting from one web site –Use site-specific formatting information: e.g., “the JobTitle is a bold-faced paragraph in column 2” –For large well-structured sites, like parsing a formal language Extracting from many web sites: –Need general solutions to entity extraction, grouping into records, etc. –Primarily use content information –Must deal with a wide range of ways that users present data. –Analogous to parsing natural language Problems are complementary: –Site-dependent learning can collect training data for/boost accuracy of a site-independent learner

An architecture for site-local learning Engineer a number of “builders”: –Infer a “structure” (e.g. a list, table column, etc) from few positive examples of that structure. –A “structure” extracts all its members f(page) = { x: x is a “structure element” on page } A master learning algorithm co-ordinates use of the “builders ” Add/remove “builders” to optimize performance on a domain. –See (Cohen,Hurst,Jensen WWW-2002)

Builder

Experimental results: most “structures” need only 2-3 examples for recognition Examples needed for 100% accuracy

Experimental results: 2-3 examples leads to high average accuracy F1 #examples

Why learning from few examples is important At training time, only four examples are available—but one would like to generalize to future pages as well…

Outline Overview of information extraction –Site-local, format-driven information extraction as recognizing structure How recognizing structure can aid in page classification –Page classification: assign a label from a fixed set (e.g “pressRelease, other”) to a page.

Previous work: Exploit hyperlinks (Slattery&Mitchell 2000; Cohn&Hofmann, 2001; Joachims 2001): Documents pointed to by the same “hub” should have the same class. This work: Use structure of hub pages (as well as structure of site graph) to find better “hubs” The task: classifying “executive bio pages”.

Background: “co-training” (Mitchell and Blum, ‘98) Suppose examples are of the form (x1, x2,y) where x1,x2 are independent (given y), and where each xi is suffcient for classification, and unlabeled examples are cheap. –(E.g., x1 = bag of words, x2 = bag of links). Co-training algorithm: 1. Use x1’s (on labeled data D) to train f1(x1) = y. 2. Use f1 to label additional unlabeled examples U. 3. Use x2’s (on labeled part of U and D) to train f2(x2) = y. 4. Repeat...

1-step co-training for web pages f1 is a bag-of-words page classifier, and S is web site containing unlabeled pages. 1. Feature construction. Represent a page x in S as a bag of pages that link to x (“bag of hubs”). 2. Learning. Learn f2 from the bag-of-hubs examples, labeled with f1. 3. Labeling. Use f2(x) to label pages from S.

Improving the “bag of hubs” representation Assumptions: –Index pages (of the kind shown) are common. –“Builders” can recognize index structures from a few positive examples (true positive examples can be extrapolated to the entire index list, with some builder). –A global bag-of-words page classifier will be moderately accurate, but it’s useful to “smooth” the predictions of the classifier so that it’s consistent with some index page(s).

Improved 1-step co-training for web pages Anchor labeling. Label an anchor a in S positive iff it points to a positive page x (according to f1). Feature construction. - Let D be the set of all (x’, a) : a is a positive anchor in x’. Generate many small training sets D i from D, (by sliding small windows over D). - Let P be the set of all “structures” found by any builder from any subset D i. - Say that p links to x if p extracts an anchor that points to x. Represent a page x as the bag of structures in P that link to x. Learning and labeling: as before.

builder extractor List1

builder extractor List2

builder extractor List3

BOH representation: { List1, List3,…}, PR { List1, List2, List3,…}, PR { List2, List 3,…}, Other { List2, List3,…}, PR … Learner

Experimental results Co-training hurts No improvement

Experimental results

Concluding remarks - “Builders” (from a site-local extraction system) let one discover and use structure of web sites and index pages to smooth page classification results. - Discovering good “hub structures” makes it possible to use 1-step co-training on small ( example) unlabeled datasets. – Average error rate was reduced from 8.4% to 3.6%. – Difference is statistically significant with a 2- tailed paired sign test or t test. – EM with probabilistic learners also works—see (Blei et al, UAI 2002) - Details to appear in (Cohen, NIPS2002)