Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson.

Similar presentations


Presentation on theme: "Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson."— Presentation transcript:

1 Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson Old Dominion University Norfolk VA, USA mln@cs.odu.edu The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 * Ahmed AlSum did this work while he was PhD student at Old Dominion University ECIR 2014 Amsterdam, Netherlands

2 What is a Web Archive? http://www.cs.odu.edu 2 ECIR 2014 Amsterdam, Netherlands

3 Thumbnails in Web Archive Internet ArchiveUK Web Archive 3 ECIR 2014 Amsterdam, Netherlands

4 Memento Terminology URI-R, R URI-M, M URI-T, TM http://www.amazon.com http://web.archive.org/web/20110411070244/http://amazon.com Original Resource Memento TimeMap 4 ECIR 2014 Amsterdam, Netherlands

5 Thumbnails Creation Challenges Scalability in Time IA may need 361 years to create thumbnail for each memento using one hundred machines. Scalability in Space IA will need 355 TB to store 1 thumbnail per each memento. Page quality 5 ECIR 2014 Amsterdam, Netherlands

6 Thumbnails Usage Challenges 6 This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.comwww.apple.com ECIR 2014 Amsterdam, Netherlands

7 From 10,500 Mementos to 69 Thumbnails. 7 ECIR 2014 Amsterdam, Netherlands

8 How many thumbnails do we need? www.unfi.comwww.unfi.com on the live Web 8 ECIR 2014 Amsterdam, Netherlands

9 How many thumbnails do we need? www.unfi.comwww.unfi.com on the live Web 9 ECIR 2014 Amsterdam, Netherlands

10 40 Thumbnails are good. 10 ECIR 2014 Amsterdam, Netherlands

11 METHODOLOGY 11 ECIR 2014 Amsterdam, Netherlands

12 Visual Similarity and Text Similarity Similar Different HTML Text 12 ECIR 2014 Amsterdam, Netherlands

13 Correlation between Visual Similarity and Text Similarity Text Similarity SimHash DOM Tree Embedded resources Memento Datetime (Capture time) Visual Similarity 13 ECIR 2014 Amsterdam, Netherlands

14 Text Similarity SimHash Computes 64-bit SimHash fingerprints with k = 4 for two pages Full HTML text ✔ The main content from the web page All the text Templates including the text The template excluding the text Calculate the differences using Hamming Distance 14 ECIR 2014 Amsterdam, Netherlands

15 Text Similarity DOM Tree Transfer each webpage to DOM tree Calculate the difference using Levenshtein Distance Levenshtein distance: is the number of operations to insert, update, and delete. 15 ECIR 2014 Amsterdam, Netherlands

16 Text Similarity Embedded resources Extract the embedded resources for each page Calculate the total number of new resources that have been added and the resources that have been removed. For example, the difference between M 1 and M 2 : Addition of 5 resources (2 javascript files and 3 images) Removal of 2 resources (1 javascript file and 1 image). 16 ECIR 2014 Amsterdam, Netherlands

17 Text Similarity Memento datetime Calculate the difference between the record capture time for both pages in seconds. 17 ECIR 2014 Amsterdam, Netherlands

18 Visual Similarity Measurement: the number of different pixels between two thumbnails To compare two thumbnails, Resize them into different dimensions: 64x64, 128x128, 256x256, and 600x600. Calculate the Manhattan distance and Zero distance between each pair 18 ECIR 2014 Amsterdam, Netherlands

19 Correlation between Visual Similarity and Text Similarity SimHashDOM tree Embedded resourcesMemento Datetime 19 SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands

20 SELECTION ALGORITHMS 20 ECIR 2014 Amsterdam, Netherlands

21 Threshold Grouping 21 ECIR 2014 Amsterdam, Netherlands

22 Threshold Grouping 22 ECIR 2014 Amsterdam, Netherlands

23 Clustering technique Input: TimeMap with n mementos A set of features. For example, F = {SimHash, Memento-Datetime} Task: Cluster n mementos in K clusters. 23 ECIR 2014 Amsterdam, Netherlands

24 Clustering technique SimHash Feature SimHash and Datetime Features 24 Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands

25 Time Normalization 25 ECIR 2014 Amsterdam, Netherlands

26 Selection Algorithms Comparison Threshold GroupingK clusteringTime Normalization TimeMap Reduction27%9% to 12%23% Image Loss2878 - 101109 # Features1 feature1 or more1 feature Preprocessing requiredYes No Efficient processingMediumExtensiveLight IncrementalYesNoYes Online/offlineBoth 26 ECIR 2014 Amsterdam, Netherlands

27 Generalization outside the Web Archive Get k thumbnails from website that has n pages 27 ECIR 2014 Amsterdam, Netherlands

28 Conclusions We explored the similarity between the text and visual appearance of the web page. We found that SimHash and Levenshtein distance have the highest correlation We presented three algorithms to select k thumbnails from n mementos per TimeMap. 28 aalsum@stanford.edu @aalsum ECIR 2014 Amsterdam, Netherlands


Download ppt "Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson."

Similar presentations


Ads by Google