Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.

Slides:

Advertisements

Similar presentations

OCLC Online Computer Library Center Steering Around the Iceberg: Economic Sustainability for Digital Collections Brian Lavoie Research Scientist OCLC Economics.

Advertisements

The Future of Scholarship in the Digital Age: The Role of Institutional Repositories Ann J. Wolpert Director of Libraries Massachusetts Institute of Technology.

DuraSpace: Digital Information All Ways, Always Pretoria, South Africa May 14 th, 2009.

Digital Collections: Storage and Access Jon Dunn Assistant Director for Technology IU Digital Library Program

Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.

1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.

The National Digital Stewardship Alliance: Community, Content, Commitment.

BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall

Looking Ahead Archive-It Partner Meeting November 12, 2013.

1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.

Information in the Digital Age Trends, Challenges, and Innovations.

What is SEO ? Search engine optimisation Way to optimise your web-site to increase your page rank in SE.

Developing PANDORA Mark Corbould Director, IT Business Systems.

1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

1 Archive-It Training University of Maryland July 12, 2007.

1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.

Annick Le Follic Bibliothèque nationale de France Tallinn,

Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.

1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

DuraCloud A service provided by Sandy Payette and Michele Kimpton.

Svein Arne Brygfjeld National Library of Norway Nordic Web Archive.

DuraCloud Managing durable data in the cloud Michele Kimpton, Director DuraSpace.

Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

Build a Free Website1 Build A Website For Free 2 ND Edition By Mark Bell.

Strategies for improving Web site performance Google Webmaster Tools + Google Analytics Marshall Breeding Director for Innovative Technologies and Research.

1 If I Could Start All Over Again: Lessons To be Learnt From The HE Community Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY UKOLN is.

Annick Le Follic Bibliothèque nationale de France Tallinn,

Value to organisations: the research library view point Susan APA, Frascati, Nov 6, 2012.

Investing in the Long-Term Viability of British Columbia’s Digital Collections A presentation to the Steering Committee of the B.C. Digitization Coalition.

Digital Preservation: Lessons learned through national action Digital Preservation Interoperability Framework Workshop April 2010.

Presentation Path  Introduction to Ved Consultancy and OpenText  Current Challenges  The Valued Customers and Sectors  Our Solutions  Demo. Together,

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.

Multimedia Web Design Professor Frank. Multimedia Combine text, graphics, sounds, and moving images in meaningful ways Use stable technology.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

The Global Video Grid: DigitalWell Update & Plan For SRB Integration Myke Smith, Manager Streaming Media Technologies University of Washington / ResearchChannel.

Robots as Characters. Mannequin Summit

Week 3 LBSC 690 Information Technology Web Characterization Web Design.

& Collaborating to Build an Open Access Archive of Public Policy Research Coalition for Networked Information Task Force Meeting.

The Internet CSC September 30, History of the Internet Developed for secure military communications Evolved from Advanced Research Projects.

IT and IM: Promises and Pitfalls Greta Lowe August 15, 2011.

OCLC Online Computer Library Center The ‘Hows’ and ‘Whys’ of Preserving Digital Materials Brian Lavoie Research Scientist OCLC CARL program: “Here Today,

1 Advanced Archive-It Application Training: Crawl Scoping.

Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”

Research Information Management: Continuity, Change and Impact Michael Jubb Research Information Network UUK Workshop 5 December 2007.

Current Quality Assurance Practices in Web Archiving Brenda Reyes Ayala, Mark Phillips, and Lauren Ko University of North Texas

1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University

Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.

DuraCloud Open technologies and services for managing durable data in the cloud Michele Kimpton, CBO DuraSpace.

Institutional Repositories: the DSpace Experience Ann J. Wolpert Director of Libraries Massachusetts Institute of Technology.

Classical Model: Web Harvesting W/ARC - GET / HTTP/ OK text/css image/gif image/jpg video JavaScript Pull from queue.

1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping.

Grant Writing for Digital Projects September 2012 IODE Project Office IODE Project Office Oostende, Belgium Oostende, Belgium Sustainability and.

Chapter 8: Web Analytics, Web Mining, and Social Analytics

Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.

The National Digital Stewardship Alliance: Stewardship, Collaboration, Inclusiveness, Exchange.

Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus

Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.

Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.

Archiving & Preserving Digital Content

Joanne Archer University of Maryland Libraries

Strategies for improving Web site performance

Challenges and Opportunities of Archiving the UK Web

László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.

Extraction, aggregation and classification at Web Scale

Latin American Government Documents Archive, LAGDA

Unit# 5: Internet and Worldwide Web

Presentation transcript:

Web The Internet Archive

Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for broader collaboration…)

What is the Internet Archive? A digital library established in 1996 that contains over four and a half petabytes (compressed) of publicly accessible digital archival material A 501(c)(3) non profit organization A technology partner to libraries, archives, museums, universities, research institutes, and memory institutions Currently archiving books, texts, film, video, audio, images, software, educational content, television, and the Internet…

Data Storage & Preservation

IA’s Web archive spans 1996-present & includes over 150 billion web instances Develop freely available, open source, web archiving & access tools (Heritrix, Wayback, NutchWAX…) Provide services that enable partners to drive their web archiving programs Perform crawls & host collections for libraries, archives, universities, museums, & other memory institutions

Today’s Landscape “The current size of the world’s digital content is equivalent to all the information that could be stored on 75bn Apple iPads, or the amount [of data] that would be generated by everyone in the world posting messages on Twitter constantly for a century...” SRc: UK Telegraph IDC annual survey, released May 2010

Today’s Web Landscape Google: “seen well over 1 trillion unique URLs” Actual indexed pages: –tens of billions+ (~40-50bil?) –Cuil: “127 bil web pages” (July 15, 2010) Hundreds of millions of “sites” –Site: publishing network endpoint; One page to millions per site –Diversity of content – streamed, social, interactive…

Collection Policies & Strategies Crawl Strategies 1) Broad, web-wide surveys from every domain, in every language, including media and text, static and interactive interfaces 2) Organic link discovery at all levels of a host/site 3) End of life, exhaustive harvests 4) Selective/Thematic & resource-specific harvests Key Inputs: registry data, trusted directories, wikipedia, subject matter experts, prior crawl data Frequency: usually ongoing but at least Yrly…

Typical Challenges of Archiving the Web Harvests are at best samples –Time & expense: can’t get everything –Rate of change: don’t get every version –Rate of collection: issues of ‘time skew’ User agents/ Protocols

10 Typical Challenges, cont. Publisher right to opt “in” or “out” –Content behind log-ins can not be archived w/o credentials –Content can be blocked by robots.txt files (which our crawlers respect by default) Structure of the sites/urls make it very hard to capture only the content of interest. Each site has its own unique set of challenges. –Some parts of sites are not “archive-friendly” (i.e. complex javascript, flash, etc.) –These sites tend to change both their technical structure and policy quickly and often.

Challenges, cont. Social networks and collaborative/semi- private spaces Immersive Worlds ~70% of the world’s digital content is now generated by individuals SRc: UK Telegraph, IDC annual survey, released May 2010

Web QA & Analysis Daunting scale, requires multi-layered approach –Automated QA to identify missing files used to render pages and prioritize URI’s for harvest –Filtering of spam and content farms discovered during harvest and post harvest –Randomized, representative, human critique of “in” vs “out” of scope per given legal mandate –Advanced analyses: Web and link graphing, text mining

Key Challenges Not all data can be crawled, need diverse methods of data collection Data may be lost no matter how carefully it is managed –Need to keep multiple, distributed copies! Harvested data can be hard to make accessible in a compelling way, on an ongoing basis, at *every* scale Research and experimentation are essential to keep pace publisher innovation, partnerships are the only way to “keep up” & to support demands of ongoing operations

Key Challenges Manageable Costs/Sustainable Approaches –Access to power & other critical operational resources –Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources –Support for on demand assembly of collections from aggregate data sets –Timeliness of collection & access Intuitive interfaces for discovering & navigating resources over time, including robust APIs Recruitment of engineering talent Funding

Thank You! Kris Carpenter Negulescu Director, Web Group Internet Archive kcarpenter [at] archive [dot] org