Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building a new archiving service for everyone!

Similar presentations


Presentation on theme: "Building a new archiving service for everyone!"— Presentation transcript:

1 Building a new archiving service for everyone!
WebRecorder.io Building a new archiving service for everyone!

2 What is WebRecorder.io? On-Demand Archiving through the browser

3 What is WebRecorder.io? On-Demand Archiving through the browser
What you see is what you archive (WYSIWYA)

4 What is WebRecorder.io? On-Demand Archiving through the browser
What you see is what you archive (WYSIWYA) Available to anyone!

5 What is WebRecorder.io? On-Demand Archiving through the browser.
What you see is what you archive (WYSIWYA) Available to anyone! “Quality over Quantity” - High-Fidelity Replay of Web Content

6 Current Service Proof-of-Concept

7 Current Service Proof-of-Concept Users can record a page and browse

8 Current Service Proof-of-Concept Users can record a page and browse
Users can download the WARC after browsing

9 Current Service Proof-of-Concept Users can record a page and browse
Users can download the WARC after browsing Users can upload any WARC and replay

10 Current Service Proof-of-Concept Users can record a page and browse
Users can download the WARC after browsing Users can upload any WARC and replay No content stored, WARCs deleted after 30 mins.

11 Current Service Proof-of-Concept Users can record a page and browse
Users can download the WARC after browsing Users can upload any WARC and replay No content stored, WARCs deleted after 30 mins. Created last year as an experiment

12 Current Service Proof-of-Concept Users can record a page and browse
Users can download the WARC after browsing Users can upload a WARC and replay back No content stored, WARCs deleted after 30 mins. Created last year as an experiment You can use at:

13 New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo

14 New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo Initially invite only to monitor capacity

15 New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo Initially invite only to monitor capacity User registration, login, individual collections

16 New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo Initially invite only to monitor capacity User registration, login, individual collections Collections available at beta.webrecorder.io/<user>/<coll>.

17 New WebRecorder.io Service
First new version up at beta.webrecorder.io for demo Initially invite only to monitor capacity User registration, login, individual collections Collections available at beta.webrecorder.io/<user>/<coll> Collections can be private, public, or shared privately (coming soon).

18 Live Demo!

19 Privacy Concerns User responsible for their own archive, has full control

20 Privacy Concerns User responsible for their own archive, has full control Collections private by default, but users may choose what to make public

21 Privacy Concerns User responsible for their own archive, has full control Collections private by default, but users may choose what to make public For now, WARCs downloadable only by owner, though may change.

22 Privacy Concerns User responsible for their own archive, has full control Collections private by default, but users may choose what to make public For now, WARCs downloadable only by owner, though may change. May have additional access levels: share read- only, share for recording, etc...

23 Privacy Concerns User responsible for their own archive, has full control Collections private by default, but users may choose what to make public For now, WARCs downloadable only by owner, though may change. May have additional access levels: share read- only, share for recording, etc... Cookies: Cookies are recorded, but not replayed

24 Privacy Concerns User responsible for their own archive, has full control Collections private by default, but users may choose what to make public For now, WARCs downloadable only by owner, though may change. May have additional access levels: share read- only, share for recording, etc... Cookies: Cookies are recorded, but not replayed Looking for ideas/better ways to address privacy. Suggestions welcome!

25 Goals/Features Provide a flexible archiving service for high- fidelity web archiving.

26 Goals/Features Provide a flexible archiving service for high- fidelity web archiving. Customizable UI, metadata and annotation support.

27 Goals/Features Provide a flexible archiving service for high- fidelity web archiving. Customizable UI, metadata and annotation support. On-Demand Full-Text Search.

28 Goals/Features Provide a flexible archiving service for high- fidelity web archiving. Customizable UI, metadata and annotation support. On-Demand Full-Text Search. Multiple privacy options, custom sharing settings.

29 Goals/Features Provide a flexible archiving service for high- fidelity web archiving. Customizable UI, metadata and annotation support. On-Demand Full-Text Search. Multiple privacy options, custom sharing settings. Multiple backends for storage.

30 Goals/Features Provide a flexible archiving service for high- fidelity web archiving. Customizable UI, metadata and annotation support. On-Demand Full-Text Search. Multiple privacy options, custom sharing settings. Multiple backends for storage. A version that can also be hosted on custom hardware, not in “the cloud”

31 Tools Used in WebRecorder.io
Built with open-source tools pywb – python wayback – Embedded in the web app, front end web service, handles url rewriting w/ custom rules, WARC reading, live web fetching. warcprox – Created by Noah Levitt of IA, HTTP/S proxy which records HTTP traffic to WARCs

32 Looking for collaborators, developers, UI designers, archivists
Help Wanted! Looking for collaborators, developers, UI designers, archivists

33 Help Wanted! Looking for collaborators, developers, UI designers, archivists If you ever wanted to participate in building an archiving service, here is your chance.

34 Help Wanted! Looking for collaborators, developers, UI designers, archivists If you ever wanted to participate in building an archiving service, here is your chance. Sign-up for mailing list on webrecorder.io or request an invite at beta.webrecorder.io Also can

35 Addendum: How It Works Symmetrical Archiving – server and client side url rewriting for record and replay follow same path Easy Part: HTML url rewriting Hard part: JavaScript Attempt to emulate original JS env as much as possible, customizable client-side hooks Far from foolproof, Flash, Java applets still problematic.

36 Help Wanted! Looking for collaborators, developers, UI designers, archivists If you ever wanted to participate in building an archiving service, here is your chance. Sign-up for mailing list on webrecorder.io or request an invite at beta.webrecorder.io Also can

37 “Symmetrical Archiving”
User browses page through /record/ path → Page is recorded to WARC and indexed User browses page through /replay/ path→ Page is replayed from WARC using index Attempt symmetry in capture and replay as much as possible. Assumption: Dynamic content generated for /record/ = Dynamic content generated for /replay/

38 “Symmetrical Archiving”
/<coll>/record/ path ↔ url rewriting system ↔fetch HTTP data ↔ recording proxy writes WARCs ↔ live web /<coll>/ path ↔ url rewriting system ↔ fetch HTTP data ↔ read from WARC Attempt symmetry in capture and replay as much as possible. Recorded content is instantly replayable.

39 “Symmetrical Archiving”
/<coll>/record/ path ↔ url rewriting system ↔fetch HTTP data ↔ recording proxy writes WARCs ↔ live web /<coll>/ path ↔ url rewriting system ↔ fetch HTTP data ↔ read from WARC Url rewriting is the hard part! Actually more like “emulating original page context” when running through a proxy/recording.

40 “When symmetry breaks”
JavaScript generated content, “leaks” to live web

41 “When symmetry breaks”
JavaScript generated content, “leaks” to live web Possible Solution: Extensive client side url- rewriting

42 “When symmetry breaks”
JavaScript generated content, “leaks” to live web Possible Solution: Extensive client side url- rewriting Checks for window.location or window.top

43 “When symmetry breaks”
JavaScript generated content, “leaks” to live web Possible Solution: Extensive client side url- rewriting Checks for window.location or window.top

44 “When symmetry breaks”
JavaScript generated content, “leaks” to live web Possible Solution: Extensive client side url- rewriting Checks for window.location or window.top

45 “When symmetry breaks”
Urls change based on timestamp, or date, eg. ?_=<timestamp> Possible Solution: Override Date(), server- side “fuzzy matching” ignoring certain query params Flash video in a custom flash SWF Possible Solution: may be able to force html5, otherwise youtube-dl may download flash version, and replace with custom player (FlowPlayer)

46 “When symmetry breaks”
Urls change based on timestamp, or date, eg. ?_=<timestamp> Possible Solution: Override Date(), server- side “fuzzy matching” ignoring certain query params Flash video in a custom flash SWF Possible Solution: may be able to force html5, otherwise youtube-dl may download flash version, and replace with custom player (FlowPlayer) General black-box Flash content with hard- coded links. Possible Solution: No good one so far! Maybe shumway.js, a javascript flash player from Mozilla?

47 “When symmetry breaks”
JavaScript generated content, “leaks” to live web Possible Solution: Extensive client side url- rewriting Checks for window.location or window.top Possible Solution: Rewrite window.location → WB_wombat_location , window.top → WB_wombat_top

48 wombat.js rewriting library
The following are some of the possible overrides by wombat.js: AJAX (XmlHTTPRequest.open) window.open History.pushState / replaceState Object.defineProperty() overrides on: document.domain, document.cookie WB_wombat_location emulates to window.location with rewriting (with server-side rewriting) WB_wombat_top emulate window.top but hides container frame (with server-side rewriting) Window postMessage() Date() constructor Seed Math.random with capture time document.write() setAttribute() / or mutation observers appendChild() / replaceChild() / insertChild()

49 pywb wombat.js is part of pywb, a new open source python “wayback machine” implementation Optional custom rules can be specified for any site by prefix or regex, specified in yaml file. Fuzzy matching rules: Specify significant query params No config file required! Out-of-the-box simple collection management tools for running an archive More details at: Future updates will include improvements to rule customization.


Download ppt "Building a new archiving service for everyone!"

Similar presentations


Ads by Google