Presentation is loading. Please wait.

Presentation is loading. Please wait.

“An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp.481-486 Today presented by Kenny Kwok.

Similar presentations


Presentation on theme: "“An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp.481-486 Today presented by Kenny Kwok."— Presentation transcript:

1 “An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp.481-486 Today presented by Kenny Kwok

2 Why need to do that? Web pages are loosely organized Usually coded in incremental way Reuse code of existing pages to write new pages (copy & paste) Lack of inline documentation usually

3 Why need to do that? With techniques to identify duplicated web pages: Feasible to carry out testing Web pages maintenance more efficient Possible to detect possible plagiarism Duplicated code => clones Two or more pages are considered as clones if, They have the same, or a very similar, structure, or They are characterized by the same values of the defined metrics

4 Types of Web Pages Server Pages Pages stored in the web server May contain server-side scripts Client Pages Static pages Saved in file with permanent content Dynamic pages Built by server at run time That paper only covered static pages and server-side scripts Since the result on server-side scripts is not conclusive, we discuss the former type only.

5 How to detect duplicated Web Pages? Two proposed approaches: Levenshtein distance (Edit distance) Occurrence frequency

6 Levenshtein distance A.k.a. Edit distance The minimal transformation distance between two strings Requires O(n 2 ) computation time where n is the size of the longer string For example, the strings u, v are –ABCDEFG –A DE G The Levenshtein distance between the strings u, v is: D(u, v) = 3

7 Levenshtein distance of Web Pages Alphabet Symbols: HTML tags (/div, /td, td, img, div, …, etc.) Extract those tags and replace with alphabet. (e.g. /div -> a, /td -> b, …) Translate the web page into “HTML-string” that compose of those symbols Levenshtien distance of pages is then the distance of their corresponding HTML-strings

8 Leveshtein distance (example) With the following HTML alphabet table: HTML-string u = hifgieb HTML-string v = hidcfgieab

9 Leveshtein distance (example) The optimal alignment of u and v is: The Levenshtein distance D(u, v) = 3 They are considered as duplicated pages (similar pages) if their distance is small But the paper has not quantitatively defined what is mean by “small”.

10 Problems and possible improvements May detect misleading similarities Due to sequence of HTML attributes False positive, different page has small distance value Suggestion: Substitute each composite tag in alphabet A with its equivalent tag in new set of alphabet A’ –But the paper does not mention any further about the A’ alphabet set

11 Problems and possible improvements May not detect meaning similarities Due to different tag with similar nature e.g. formatting tag (H1, H2, H3) Suggestion: Define alphabet of formatting tags in A’’. Eliminate the HTML-string symbols that contains alphabet A’’. –Again, the paper does not mention any further about the A’’ alphabet set

12 Occurrence frequency Make use of HTML-array Compare the Euclidean distance of their HTML- array ED(u, v) = 1.732 Much faster in computation Make identify all clones in previous method More likely to detect false positive clones The paper, again, does not describe the criteria of clone and the value of ED. Not clue of how “small” it should be

13 Experiment Result Levensthein: –Accurate –Slow Frequency measure: –Introduce false positive –Much faster Suggestions: –Frequency measure method to extract candidates, use Levensthein distance to verify the result

14 Conclusion Two web page clones detection method are proposes and evaluated Each has its strength and weaknesses but possible to combine into refinement process Clone detection techniques is useful in: Identify a case of plagiarism Highlight reuse of pattern of HTML tags Facilitates Web maintenance Facilitates testing process of web applications

15 Final Note It has not mentioned the translation alphabet table and how to obtain it correctly The paper does not mention the distance similarity criteria for the experiment The experiment does not cover the detection of plagiarism although it may be possible

16 Q&A Thank You


Download ppt "“An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp.481-486 Today presented by Kenny Kwok."

Similar presentations


Ads by Google