Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.

Slides:



Advertisements
Similar presentations
Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Advertisements

Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Web Page Categorization without the Web Page Author: Min-Yen Kan WWW-2004.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Information Extraction from HTML: General Machine Learning Approach Using SRV.
Scalable Text Mining with Sparse Generative Models
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Graph Classification.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.
Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
A Language Independent Method for Question Classification COLING 2004.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 23: Probabilistic Language Models April 13, 2004.
Algorithmic Detection of Semantic Similarity WWW 2005.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Post-Ranking query suggestion by diversifying search Chao Wang.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Learning to Detect and Classify Malicious Executables in the Wild by J
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Presentation transcript:

Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh

Outline Introduction URL Feature Extraction –Recursive segmentation –Using URL feature classes Experimental results Conclusion

Introduction A web page's uniform resource locator (URL) is the least expensive to obtain One of the more informative sources with respect to classification The authors approach webpage classification only by using the URLs –Feature extraction from URL –Apply machine learning algorithms

URL Baseline Segmentation Segment URL at non-alphanumeric characters and at URI-escaped entities (e.g., '%20') to create smaller tokens Baseline segmentation is straightforward to implement and typically results in 4-7 tokens example

Recursive Segmentation Concatenated words (e.g., activatealert) are especially prevalent in website domain names Segmenting these tokens into its component words is likely to increase performance This paper performs the segmentation by information content (entropy) reduction additionally A token T can be split into n partitions if where ti denotes the ith partition of T

Recursive Segmentation A partitioning that has lower entropy than others would be a more probable parse of the token Such entropies can be estimated by collecting the frequencies of tokens in a large corpus Applying a tree partition strategy (O(n log n)) to replace all the 2^(T-1) partitions example

URI Components and Length features First spilt the URL via URI protocol scheme :// host / path / document. extension ? query # fragment A token that occurs in different parts of URLs may contribute differently to classification The authors feature set by qualifying them with their components The absence of certain components can influence classification as well The absence of certain components also can influence classification as well example

Orthographic Features Using the surface form of a token also presents challenges for generalization –e.g vs Add features for tokens with capitalized letters and/or numbers that differentiate these tokens by their length These features are added both in a general, URL-wide feature as well as ones that are URI component-specific

Sequential Features N-grams token might also help in classification –The authors use 2, 3, and 4-grams Sequential order among tokens also matters –“web spider” and “spider web” –consider model left-to-right precedence between tokens example

Evaluate on Multi-class Classification Employ a subset of the WebKB, containing 4,167 pages Four classes ( student, faculty, course and project ) Use SVM and Maximum entropy classification method Marco F measure is used

Results on WebKB

Evaluate On Hierarchical Categorization Evaluate on the Open Directory Project The snapshot dated 3 August 2004, which encompasses over 4.4 M URLs categorized into 17 first-level and 508 second-level categories The authors use 100,000 randomly chosen ODP URLs to assemble a testing (and training) corpus for the two-level, hierarchical experiments Only 360 second-level categories are used.

Results on ODP

Conclusion The authors have extended previous work and added features to model URL component length, content, orthography, token sequence and precedence Also evaluate the use of these features over a large set of tasks including relevance, categorization and Pagerank prediction. These features do not perform as well with typical web site entry points (i.e., just the domain name), as they attempt to leverage the internal path structure of the URL.

scheme :// host / path / document. extension ? query # fragment