Presentation is loading. Please wait.

Presentation is loading. Please wait.

TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.

Similar presentations


Presentation on theme: "TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013."— Presentation transcript:

1 TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013 JULY 25 – 26, 2013 INDIANAPOLIS, INDIANA USA

2 Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 2

3 Joint Conference on Digital Libraries (JCDL) 2013 A FABLE FROM WAYBACK 7/26/13Scott G. Ainsworth Michael L. Nelson 3

4 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson 4 2005-05- 14 01:36:08 +9 days +18 days +7 months +2.1 years

5 Joint Conference on Digital Libraries (JCDL) 2013 QUESTIONS How much temporal spread exists in composite mementos? How can temporal spread be minimized? What factors contribute, positively or negatively, to spread? Does combining multiple archives produce better results? Would users with differing goals benefit from different minimization policies and heuristics? How can temporal coherence be displayed to users—simply? 7/26/13Scott G. Ainsworth Michael L. Nelson 5

6 Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 6

7 Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK Control Crawl Data Quality, Future collections Spaniol et al. – crawling strategy Denev et al. – change rates by MIME type and depth Ben Saad et al. – metadata from crawl used to select best results from archive Our Focus: Existing Data Quality Existing collections Datetime selection policies 7/26/13Scott G. Ainsworth Michael L. Nelson 7

8 Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK Use Patterns AlNoamony et al. – Archive Access Patterns Humans vs. Robots Dip, dive, slide, & skim Identifying Duplicates Simple identity – images, other binary formats direct comparison Hash comparison HTML, CSS (text) Shingling, Jaccard distances, etc. SimHash most promise 7/26/13Scott G. Ainsworth Michael L. Nelson 8

9 Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK – MEMENTO* HTTP extension for datetime negotiation Request Response 7/26/13Scott G. Ainsworth Michael L. Nelson 9 GET /http://www.cs.odu.edu/ HTTP/1.1 … Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT … HTTP/1.1 200 OK … Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT … * https://datatracker.ietf.org/doc/draft-vandesompel-memento/ https://datatracker.ietf.org/doc/draft-vandesompel-memento/

10 Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  How much of the Web is archived  Temporal Drift  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 10

11 Joint Conference on Digital Libraries (JCDL) 2013 HOW MUCH IS ARCHIVED? 7/26/13Scott G. Ainsworth Michael L. Nelson 11 35 – 90% At least one archived copy 17 – 49% 2 – 5 copies 1 – 8% 6 – 10 copies 8 – 63% > 10 copies JCDL’11 Internet Archive Search Engine Other

12 Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  How much of the Web is archived  Temporal Drift  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 12

13 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL DRIFT Comparing two policies Sliding –target datetime changes Sticky – target datetime held steady 7/26/13Scott G. Ainsworth Michael L. Nelson 13

14 Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 14 2005-05-14 01:36:08

15 Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 15 2005-04-22 00:17:52

16 Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 16 2005-03-31 09:16:10

17 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL DRIFT WHAT WE EXPECTED 2005-05-14 @ 01:36:08 WHAT WE GOT 2005-03-31 @ 09:16:10 7/26/13Scott G. Ainsworth Michael L. Nelson 17

18 Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET What if the target is held steady? (Enabled by Memento API) 7/26/13Scott G. Ainsworth Michael L. Nelson 18

19 Joint Conference on Digital Libraries (JCDL) 2013 2005-05-14 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 19 MementoFox Extension 2005-05-14 01:36:08

20 Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 20 2005-04-22 00:17:52

21 Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 21 2005-05- 14 01:36:08

22 Joint Conference on Digital Libraries (JCDL) 2013 DRIFT COMPARISON Page SlidingSticky DatetimeDriftDatetimeDrift CS Home 2005-05-14 01:36:08 – 2005-05-14 01:36:08 – Science Home 2005-04-22 00:17:52 22.1 days 2005-04-22 00:17:52 22.1 days CS Home 2005-03-31 09:16:10 43.7 days (+21.6 days) 2005-05-14 01:36:08 – Mean32.9 days11.0 days 7/26/13Scott G. Ainsworth Michael L. Nelson 22

23 Joint Conference on Digital Libraries (JCDL) 2013 MEDIAN DRIFT BY STEP ● Sliding ● Sticky Median Drift (months) 7/26/13Scott G. Ainsworth Michael L. Nelson 23 Step Number JCDL’13

24 Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  How much of the Web is archived  Temporal Drift  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 24

25 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson 25

26 Joint Conference on Digital Libraries (JCDL) 2013 COMPOSITE MEMENTO PRESENTATIONSTRUCTURE 7/26/13Scott G. Ainsworth Michael L. Nelson 26

27 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson 27 2005-05- 14 01:36:08 +9 days +18 days +7 months +2.1 years

28 Joint Conference on Digital Libraries (JCDL) 2013 EMBEDDED RESOURCES ResourceMemento-DatetimeDeltaResource Memento- Datetime Delta http://www.cs.odu.edu2005-05-14 01:36:08spacer.gif2005-06-01 16:23:1018.6 d mm_menu.js2005-05-23 02:39:129.0 djimcheng.gif2005-06-01 16:37:3918.6 d style.css2005-05-23 02:39:399.0 djsmith.gif2005-06-01 16:58:5018.6 d gfx-logo-odu-crown.gif2005-05-23 02:39:399.0 drmenu_1st_featured_alumni.png2005-06-01 21:21:4518.8 d ddmenu_ddown.js2005-05-23 02:39:439.0 dhmenu_college_...-new.png2005-12-21 20:14:257.3 mo university.js2005-05-23 02:39:569.0 drmenu_1st_upcoming_news.png2005-12-21 20:15:147.3 mo rmenu_1st_about.png2005-06-01 13:40:2518.5 drmenu_1st_upcoming_events.png2005-12-21 21:01:127.3 mo rmenu_bottom_229.gif2005-06-01 14:07:2918.5 dlmenu_1st_resources.png2005-12-28 17:47:417.5 mo shadow-bl.gif2005-06-01 14:55:5318.6 dbullet_blue_triangle.gif2005-12-28 19:43:487.5 mo ecsbdg.jpg2005-06-01 14:56:1718.6 dlogo-cs.gif2005-12-28 19:54:297.5 mo shadow-br.gif2005-06-01 15:18:1818.6 drmenu_1st_featured_student.png2007-06-12 02:36:072.1 years gfx-btn-go-dblue.gif2005-06-01 15:34:1918.6 dshadow-b.gif2007-06-21 02:35:172.1 years shadow-tr.gif2005-06-01 15:55:5718.6 dshadow-r.gif404 Not Found header-right1.gif2005-06-01 16:06:1618.6 d 7/26/13Scott G. Ainsworth Michael L. Nelson 28 Embedded Resources26 Mean Delta125.9 days Standard Deviation207.7 days Spread2.1 years

29 Joint Conference on Digital Libraries (JCDL) 2013 REPRESENTING SPREAD COMPOSITE MEMENTO TEMPORAL SPREAD CHART 7/26/13Scott G. Ainsworth Michael L. Nelson 29 Root Embedded Diff. Domain Reused

30 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD – ODU CS 7/26/13Scott G. Ainsworth Michael L. Nelson 30

31 Joint Conference on Digital Libraries (JCDL) 2013 FIRST EXPERIMENT 1,000 URIs from DMOZ (Open Directory) Download all timemaps Download all composite mementos Download all embedded resources Single and Multiple Archives Four Heuristics 7/26/13Scott G. Ainsworth Michael L. Nelson 31

32 Joint Conference on Digital Libraries (JCDL) 2013 PRELIMINARY RESULTS CountDescriptionPercent 1,000Root URI-Rs 910Root timemaps91% 87,847Root URI-Ms in timemaps 96.5URI-Ms per Root URI-R 85,570Root memento downloaded97% 1,488,420Embedded URI-Rs 17.4Embedded URI-Rs per Root memento 7/26/13Scott G. Ainsworth Michael L. Nelson 32

33 Joint Conference on Digital Libraries (JCDL) 2013 SINGLE/MULTI & HEURISTICS DescriptionMinimize Distance, Single Archive Minimize Distance, Multi- Archive 3-Month Window, Multi- Archive Embedded URI-Rs1,488,4401,488,4201,447,351 Embedded URI-Ms in timemaps1,169,7871,186,456500,541 URI-M/Embedded URI-R0.790.800.35 % Complete73.8%75.4%33.8% Mean spread200.2200.115.1 Standard Deviation219.2219.914.3 7/26/13Scott G. Ainsworth Michael L. Nelson 33

34 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 34 1 Memento, Bracketed Root

35 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 35 1 Memento, Bracketed Root

36 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 36 1 Memento, Bracketed Root

37 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 37 1 Memento, Root Not Bracketed

38 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 38 1 Memento, Root Not Bracketed

39 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 39 1 Memento, No Last-Modified

40 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 40 1 Memento, Before Root

41 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 41 2 Mementos, Root Not Bracketed

42 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 42 2 Mementos, Root Not Bracketed

43 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 43 2 Mementos, Use Content – Similarity

44 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 44 2 Mementos, Contents Equal or Equivalent

45 Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 45 2 Mementos, Contents Not Equal or Equivalent

46 Joint Conference on Digital Libraries (JCDL) 2013 CURRENT EXPERIMENT 4,000 URIs from JCDL’11 “How Much…” paper 1 URI/month vice all Temporal coherence patterns Target WSDM 2013 7/26/13Scott G. Ainsworth Michael L. Nelson 46

47 Joint Conference on Digital Libraries (JCDL) 2013 CURRENT EXPERIMENT 7/26/13Scott G. Ainsworth Michael L. Nelson 47

48 Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 48

49 Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Timemaps, Redirection, Missing Mementos Timemaps only tell part of the story URI-R redirection (302 from source) URI-M redirection (Archive action) Mementos in timemaps but not accessible Policies must consider user needs Leave it missing Show “best” substitute 7/26/13Scott G. Ainsworth Michael L. Nelson 49

50 Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Similarity & Duplication Delta are currently | root – embedded | If bracketing mementos are identical, should delta be zero? HTML is usually modified by the archive Can’t check for equality Shingling? SimHash? 7/26/13Scott G. Ainsworth Michael L. Nelson 50 0 +30d –30d

51 Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Communicating Status 7/26/13Scott G. Ainsworth Michael L. Nelson 51

52 Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Policies & Heuristics Current Spread Heuristics Minimize distance Past only Past preferred Near or within distance Single vs. multi-archive Refine to meet user expectations Speed (minimize time) Accuracy (minimize temporal error) 7/26/13Scott G. Ainsworth Michael L. Nelson 52

53 Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 53

54 Joint Conference on Digital Libraries (JCDL) 2013 CONCLUSION Extensive research on improving acquisition exists Best use of existing collections needs study We are looking at Characterizing existing holdings Characterizing temporal coherence Policies that minimize impact of temporal incoherence Visualizations of temporal coherence 7/26/13Scott G. Ainsworth Michael L. Nelson 54

55 Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 55 Coherent

56 Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 56 Violation

57 Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 57 What do these mean to users? (3) (2) (1) (4)

58 Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 58 What does this mean to users?


Download ppt "TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013."

Similar presentations


Ads by Google