Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.

Web Archiving @ The Internet Archive

Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for broader collaboration…)

What is the Internet Archive? A digital library established in 1996 that contains over four and a half petabytes (compressed) of publicly accessible digital archival material A 501(c)(3) non profit organization A technology partner to libraries, archives, museums, universities, research institutes, and memory institutions Currently archiving books, texts, film, video, audio, images, software, educational content, television, and the Internet… www.archive.org

Data Storage & Preservation

IA’s Web archive spans 1996-present & includes over 150 billion web instances Develop freely available, open source, web archiving & access tools (Heritrix, Wayback, NutchWAX…) Provide services that enable partners to drive their web archiving programs Perform crawls & host collections for libraries, archives, universities, museums, & other memory institutions www.archive.org/web/www.archive.org/web/ www.archiveit.orgwww.archiveit.org

Today’s Landscape “The current size of the world’s digital content is equivalent to all the information that could be stored on 75bn Apple iPads, or the amount [of data] that would be generated by everyone in the world posting messages on Twitter constantly for a century...” SRc: UK Telegraph http://www.telegraph.co.uk/technology/news/7675214/Zettabytes-overtake-petabytes-as-largest-unit-of-digital-measurement.html IDC annual survey, released May 2010

Today’s Web Landscape Google: “seen well over 1 trillion unique URLs” Actual indexed pages: –tens of billions+ (~40-50bil?) –Cuil: “127 bil web pages” (July 15, 2010) Hundreds of millions of “sites” –Site: publishing network endpoint; One page to millions per site –Diversity of content – streamed, social, interactive…

Collection Policies & Strategies Crawl Strategies 1) Broad, web-wide surveys from every domain, in every language, including media and text, static and interactive interfaces 2) Organic link discovery at all levels of a host/site 3) End of life, exhaustive harvests 4) Selective/Thematic & resource-specific harvests Key Inputs: registry data, trusted directories, wikipedia, subject matter experts, prior crawl data Frequency: usually ongoing but at least Yrly…

Typical Challenges of Archiving the Web Harvests are at best samples –Time & expense: can’t get everything –Rate of change: don’t get every version –Rate of collection: issues of ‘time skew’ User agents/ Protocols

10 Typical Challenges, cont. Publisher right to opt “in” or “out” –Content behind log-ins can not be archived w/o credentials –Content can be blocked by robots.txt files (which our crawlers respect by default) Structure of the sites/urls make it very hard to capture only the content of interest. Each site has its own unique set of challenges. –Some parts of sites are not “archive-friendly” (i.e. complex javascript, flash, etc.) –These sites tend to change both their technical structure and policy quickly and often.

Challenges, cont. Social networks and collaborative/semi- private spaces Immersive Worlds ~70% of the world’s digital content is now generated by individuals SRc: UK Telegraph, IDC annual survey, released May 2010

Web QA & Analysis Daunting scale, requires multi-layered approach –Automated QA to identify missing files used to render pages and prioritize URI’s for harvest –Filtering of spam and content farms discovered during harvest and post harvest –Randomized, representative, human critique of “in” vs “out” of scope per given legal mandate –Advanced analyses: Web and link graphing, text mining

Key Challenges Not all data can be crawled, need diverse methods of data collection Data may be lost no matter how carefully it is managed –Need to keep multiple, distributed copies! Harvested data can be hard to make accessible in a compelling way, on an ongoing basis, at *every* scale Research and experimentation are essential to keep pace publisher innovation, partnerships are the only way to “keep up” & to support demands of ongoing operations

Key Challenges Manageable Costs/Sustainable Approaches –Access to power & other critical operational resources –Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources –Support for on demand assembly of collections from aggregate data sets –Timeliness of collection & access Intuitive interfaces for discovering & navigating resources over time, including robust APIs Recruitment of engineering talent Funding

Thank You! Kris Carpenter Negulescu Director, Web Group Internet Archive kcarpenter [at] archive [dot] org

Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.

Similar presentations

Presentation on theme: "Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.

Similar presentations

Presentation on theme: "Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for."— Presentation transcript:

Similar presentations

About project

Feedback