N-Gram-based Dynamic Web Page Defacement Validation Woonyon Kim Aug. 23, 2004 NSRI, Korea.

Slides:



Advertisements
Similar presentations
Managing Web server performance with AutoTune agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigu Jangwon Han Seongwon Park
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
CMo: When Less Is More Yevgen Borodin Jalal Mahmud I.V. Ramakrishnan Context-Directed Browsing for Mobiles.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Video Shot Boundary Detection at RMIT University Timo Volkmer, Saied Tahaghoghi, and Hugh E. Williams School of Computer Science & IT, RMIT University.
Web Defacement Anh Nguyen May 6 th, Organization Introduction How Hackers Deface Web Pages Solutions to Web Defacement Conclusions 2.
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
SMS-Based web Search for Low- end Mobile Devices Jay Chen New York University Lakshmi Subramanian New York University
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Expediting Programmer AWAREness of Anomalous Code Sarah E. Smith Laurie Williams Jun Xu November 11, 2005.
Jarhead Analysis and Detection of Malicious Java Applets Johannes Schlumberger, Christopher Kruegel, Giovanni Vigna University of California Annual Computer.
Lecturer: Ghadah Aldehim
INTRODUCTION TO DHTML. TOPICS TO BE DISCUSSED……….  Introduction Introduction  UsesUses  ComponentsComponents  Difference between HTML and DHTMLDifference.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
1 CS 3870/CS 5870 Static and Dynamic Web Pages ASP.NET and IIS.
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
JingTao Yao Growing Hierarchical Self-Organizing Maps for Web Mining Joseph P. Herbert and JingTao Yao Department of Computer Science, University or Regina.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
The identification of interesting web sites Presented by Xiaoshu Cai.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
nd Joint Workshop between Security Research Labs in JAPAN and KOREA Profile-based Web Application Security System Kyungtae Kim High Performance.
Universiti Utara Malaysia Chapter 3 Introduction to ASP.NET 3.5.
CS 376b Introduction to Computer Vision 02 / 22 / 2008 Instructor: Michael Eckmann.
INTRODUCTION TO JAVASCRIPT AND DOM Internet Engineering Spring 2012.
Detection Unknown Worms Using Randomness Check Computer and Communication Security Lab. Dept. of Computer Science and Engineering KOREA University Hyundo.
“An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp Today presented by Kenny Kwok.
1 Enhancements in Query Evaluation and Page Summarization of The Thinking Algorithm M. Shoaib Jameel Amar Akshat Chingtham Tejbanta Singh Department of.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,
Software Testing Łukasz Wojcieszek s2690 Tomasz Wyrzuc s2675.
Securing Passwords Against Dictionary Attacks Presented By Chad Frommeyer.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Robust Real Time Face Detection
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Data dissemination in wireless computing environments
Managing Web Server Performance with AutoTune Agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigus Presented by Changha Lee.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Spectrum Sensing In Cognitive Radio Networks
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
Improvement of Apriori Algorithm in Log mining Junghee Jaeho Information and Communications University,
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference.
Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.
A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages , Feb Apr
Challenges in Creating an Automated Protein Structure Metaserver
UNIT 15 Webpage Creator.
MG4J – Managing GigaBytes for Java Introduction
Introduction to Functions
Implementation and Maintenance
HITS Hypertext Induced Topic Selection
Republic of Korea (KHOA)
HITS Hypertext Induced Topic Selection
HYPERTEXT PREPROCESSOR BY : UMA KAKKAR
Website Testing Checklist
False discovery rate estimation
Matching Program Versions
Presentation transcript:

N-Gram-based Dynamic Web Page Defacement Validation Woonyon Kim Aug. 23, 2004 NSRI, Korea

Contents Introduction Related Works N-Gram Frequency Index N-Gram-based Index Distance Experiments Conclusions

Introduction Defacement of Web Sites  CSI/FBI 2001  38 % of web sites were hacked.  21% of hacked sites were not aware of their own defacements.  Zone-h  The defaced web pages are rapidly increased year by year. (.kr domain : about 200% increase) Current solutions  Hash-based detection system for minimizing damage  Intrusion-tolerant system for contiguous service Problems of current solutions  Current solutions use hash code as validation metric. Hash code can ’ t support dynamic characteristics.

Introduction N-Gram-based Index Distance (NGID)  A validation metric of dynamically changing web pages  The sum of absolute differences of frequency probability of N-Grams that can be found from both indexes.  NGID represents the similarity of two web pages.  NGID can be used to validate web pages with dynamic components or static.

Related Works Hash-based validation system  Detecting web page defacements by comparing two hash codes  Hash code is useful metric for large and static web pages.  Hash code can ’ t work properly on the dynamically changing web pages. Intrusion-tolerant system  Hash code is used to validate web pages.  It also has limitation on dynamic web pages.

N-Gram Frequency Index (1) N-Gram  An N-character slice of a string  For example “ TEXT ”  2-Gram : TE, EX, XT N-Gram Frequency Index  An index file that is sorted from the most frequent N- Grams to the least frequent ones  It cuts off N-Grams below at a particular rank. So, minor changes are ignored. And this feature of N-Gram Frequency Index supports dynamics.

N-Gram Frequency Index (2) How to generate  Count all N-Grams frequencies in a web page.  Sort N-Grams from the most frequent to the least.  Cut off N-Grams below at a particular rank.  Sum up the frequencies of the remained N-Grams.  Compute the probability of each N-Gram frequency.  Save the N-Grams, frequency of the N-Grams, the probability of N-Grams into an index file.

N-Gram-based Index Distance(NGID) The sum of absolute difference of frequency probability of same N-Grams that can be found from both web pages. A metric for detecting whether a web page is defaced or not.

N-Gram-based Index Distance Evaluation is done by comparing NGID to validation threshold Evaluation  Valid : NGID <= Validation Threshold  Invalid : NGID > Validation Threshold

Experiments Assumptions  Select 100 web pages  Choose 0.1 for Validation Threshold of NGID. Procedure for false positive  Connect to a selected web page at a time in remote place.  Download a page and save it a file.  Validate it using NGID.  Validate it using Hash Code.  Above four steps are recursively applied.  Every 30-minute in a day News PaperBroadcastPortalPublicTotal

Experiments False Positive News Paper Broadc ast PortalPublicTotal No. of Web Sites No. of False Positive (MD5) No. of False Positive (NGID) 11002

Experiments False Positive

Experiments NGID value as time flows The time of contents update 1 2

Experiments Procedure for false negative  Collecting 50 web pages that are normal pages and hacked pages from zone-h.  Validate it using NGID.  Validate it using Hash Code. Result of Hash code  50-web pages are detected to be defaced.  The number of false negative is 0.

Experiments False Negative

Conclusions N-Gram-based Index Distance  A metric to evaluate dynamic web page defacement.  NGID can validate dynamically changing web pages. Future Works  Need a learning model to resolve a validation threshold of each web page.  Need a feedback mechanism of normal index.