Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
CMPUT 466/551 Principal Source: CMU
Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Xyleme A Dynamic Warehouse for XML Data of the Web.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
1 Quicklink Selection for Navigational Query Results Deepayan Chakrabarti Ravi Kumar Kunal Punera
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Classification and Prediction: Regression Analysis
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Active Learning for Class Imbalance Problem
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
CrossCheck: Combining Crawling and Differencing to Better Detect Cross-browser Incompatibilities in Web Applications Shauvik Roy Choudhary, Mukul Prasad,
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Bug Localization with Machine Learning Techniques Wujie Zheng
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Learning from Observations Chapter 18 Through
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Phishing Website Detection & Target Identification October 30 th, 2015 Samuel Marchal*, Kalle Saari*, Nidhi Singh †, N.Asokan* *Aalto University - † Intel.
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Class Imbalance in Text Classification
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
KNN & Naïve Bayes Hongning Wang
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Julián ALARTE DAVID INSA JOSEP SILVA
Site-Level Web Template Extraction
Based on Menu Information
Erasmus University Rotterdam
Web Data Extraction Based on Partial Tree Alignment
Web Page Cleaning for Web Mining
Discriminative Probabilistic Models for Relational Data
Web Content Extraction Based on Maximum Continuous Sum of Text Density
Using Link Information to Enhance Web Page Classification
Information Retrieval and Web Design
Presentation transcript:

Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin

Agenda Motivation –Potential applications –Related work Approach –Page-level template detection –Regularized isotonic regression Some experimental results

Template Example:

Copyright message Advertisements Look and feel Links for navigation

Applications Web Ranking –Do not match query to text in templates Duplicate Detection –Do not shingle text inside templates Summarization –Do not use text within templates for summary

Site-level Template Detection Templates = “page-fragments” that recur across several pages of a website. –Eg: copyright, navigation links Page-fragment can be –HTML code (tags) –Visible Text –DOM nodes (structure + text) Simple two pass algorithms –Hash page-fragments and count occurrences –Mark templates in second pass

Advantages: No labeled training data needed Very high precision Issues: Inefficient when pages are not processed in site order –Eg: in a web crawler pipeline –Need to maintain hashes and counts for all sites –Marking site-level templates for new websites Not all templates are site-level in nature –Low recall Site-level Template Detection

Advantages: No labeled training data needed Very high precision Issues: Inefficient when pages are not processed in site order –Eg: in a web crawl pipeline –Need to maintain hashes and counts for all sites –Marking site-level templates for new websites Not all templates are site-level in nature –Low recall Site-level Template Detection Only use page-level information Learn a general model for templates

Page-level Model-based Template Detection Problem: Template detection –Using only information local to a webpage –Detect all templates: not just site-level –No manually labeled training data Our Approach: –Obtain training data via site-level approach –Learn a classification model for “templateness” For each internal DOM node –Enforce a global monotonicity property of “templateness”

Automatically Labeling Data Use site-level approach –3,000 website (200 webpages per site) –Obtained ~1M labeled DOM nodes Labeled data has a bias –Some template DOM nodes labeled as non-templates –False negatives are noise Extract general structural and content cues from the DOM nodes –Generalize over the site-level training data

Cues used Implicitly by Humans Placement on screen Link Density BGColor Aspect Ratio Average Sentence Size Fraction of text outside anchors

Learning the “Templateness” Classifier Extract features of DOM nodes from cues Learn weights for these features –2-class problem –Logistic regression classifier –Simple classifier, avoids fitting noise in the data –“Templateness” = probability of belonging to template class –Separate classifiers learned for nodes of different sizes Each node in the DOM tree is classified –Past work classify segments of web pages –Segmentation might mix template and non-template content

“Templateness” is a monotonic property A DOM node is a template if and only if all its children are templates. A DOM node is a template only if if

“Templateness” Monotonicity Each node is classified in isolation –Classifier scores needn’t be monotonic –Classifier might misclassify nodes Post-processing “templateness” scores –Enforces monotonicity –Corrects misclassifications by smoothing “Templateness” scores are real numbers x i A node in the DOM tree is a template if and only if all its children are templates

Smoothing via Regularized Isotonic Regression Given raw classification scores x 1,…,x n for n nodes in the DOM tree, find smoothed scores y 1,…,y n (such that i → j  y i ≤ y j ) to minimize: ∑|x i -y i | + c. (# distinct y i ’s) Low L 1 distance from the raw scores Create as few “sections” as possible “Tradeoff” parameter This results in: 1.Smoothed monotonic scores y i 2.“Sections” of the webpage Section = adjacent nodes with same y i values ∑|x i -y i | Monotonicity constraint

Smoothing via Regularized Isotonic Regression Lemma: For L 1 distance, each y i must equal some x j The optimal solution found using a dynamic program –Complexity: O(n 2 log n), n = # of DOM nodes in page –Equals complexity of algorithms for non-regularized isotonic regression (c=0) –On a Pentium 4, 3GHz, 512MB running FreeBSD: around 60ms for cnn.com (n=292)

Page-level Model-based Template Detection Problem: Template detection Our Approach: –Obtain training data via site-level approach –Classification model for “templateness” Designed features for DOM nodes Each DOM node labeled by a logistic regression classifier –Enforce a global monotonicity property of “templateness” Formulate it as regularized isotonic regression over trees Optimal solution via dynamic program

Template Detection Accuracy Data: manually classified DOM nodes in webpages Results: –PageLevel system works very well –Smoothing improves classification accuracy f-measure Classifier Only Classifier + Smoothing

Duplicate Detection Data: 2359 pages from 3 lyrics websites –1711 duplicate pairs (same song, different websites) –2058 non-duplicate pairs (different songs, same website) Errors occur when shingles hit template content PageLevel detects more templates than SiteLevel

Conclusions Page-level model-based template detection Used no manually labeled training data “Templateness” monotonicity property Regularized isotonic regression –might be of independent interest Showed empirically that PageLevel generalizes over the SiteLevel data