Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information.
Chapter 5: Introduction to Information Retrieval
Large-Scale Entity-Based Online Social Network Profile Linkage.
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
Named Entity Classification Chioma Osondu & Wei Wei.
Privacy Wizards for Social Networking Sites Reporter : 鄭志欣 Advisor: Hsing-Kuo Pao Date : 2011/01/17 1.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado! Francisco Tenorio! Jacques.
Automatic Sentiment Analysis in On-line Text Erik Boiy Pieter Hens Koen Deschacht Marie-Francine Moens CS & ICRI Katholieke Universiteit Leuven.
Scalable Text Mining with Sparse Generative Models
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Prophiler: A fast filter for the large-scale detection of malicious web pages Reporter : 鄭志欣 Advisor: Hsing-Kuo Pao Date : 2011/03/31 1.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Automated malware classification based on network behavior
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features Cai-Nicolas Ziegler, Michal Skubacz Siemens.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Master Thesis Defense Jan Fiedler 04/17/98
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
1 Sketch tools and Related Research Rachel Patel.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
The INTERNET Worldwide network of computers linked together.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.
Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu /12/5.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Post-Ranking query suggestion by diversifying search Chao Wang.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
A Simple Approach for Author Profiling in MapReduce
Sentiment analysis algorithms and applications: A survey
WEBSITE BASICS E-Commerce.
Source: Procedia Computer Science(2015)70:
الوحدة 5 مقدمة في شبكة الانترنت.
COUNTRIES NATIONALITIES LANGUAGES.
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
The Internet.
Presentation transcript:

Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1

Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages ACM, 2008 Reference 2

 Introduction  Language Identification Based On URLs  Experimental Setup  Experimental Results  Conclusions Outline 3

 Given only the URL of a web page, can we identify its language?  Web crawlers  Personalized Web Browser  We consider the problem of determining the language of a web page using only its URL.  English, French, German, Spanish, and Italian .com (60%),.org (10%)  Introduction 4

 Applying machine learning techniques  Features  Word features  N-grams features  Custom-made features  Machine learning algorithm  Naïve Bayes  Decision Tree  Relative Entropy  Maximum Entropy Introduction 5

 Introduction  Language Identification Based On URLs  Experimental Setup  Experimental Results  Conclusions Outline 6

 Words as features  Remove “www”, ”index”, ”html” …,etc.  For example,  Split into : internetwordstats, com, africa  cnn, gov are indicative of English  Produits,recherche are indicative of French Extracting Feature Vectors 7

 Trigrams as features  Start with the some token as the method above(word as features)  Eg, weather  “_we”, “wea”, “eat”, “ath”,”the”,”her”, “er_”  “_th”, “ing” are very common in English 8

 Custom-made features  Top-level domain country code  OpenOffice dictionaries  Dictionary with city names  Number of hyphens 9

 Country code top-level domain only (ccTLD)  Country code top-level domain plus (ccTLD+)  Naïve bayes (NB)  Decision Tees (DT)  Relative Entropy(RE)  Maximum Entropy(ME) Classification Algorithms 10

 Introduction  Language Identification Based On URLs  Experimental Setup  Experimental Results  Conclusions Outline 11

 The algorithms were evaluated on three different data sets  Open Directory Project  Microsoft’s Live Search  1260 pages form a large web crawl labels by hand DataSet 12

Data setLanguageTraining sizeTest size Open Directory Project English145, German144, French144, Spanish144, Italian144, Search Engine Results English99, German99, French99, Spanish99, Italian99, Web Crawl English01082 German081 French057 Spanish019 Italian021 13

 Introduction  Language Identification Based On URLs  Experimental Setup  Experimental Results  Conclusions Outline 14

 P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−)))  = p(+|+)  = p(−|−)  F = 2/(1/R+1/P) 15

Human Performance 16

Baseline : ccTLD 17

18

19

20

21

 This paper shows that high quality language identifiers for web pages can be built based on URLs alone.  The largest challenge is to identify English-looking URLs of non-English web pages. Conclusions 22