Presentation on theme: "Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information."— Presentation transcript:
Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information Engineering, Nanzan university, Japan Adjunct researcher in Digital Library Section at National Diet Library (Japan)
Outline : Today s Talk Fragile Intelligence on Internet Disappearing Scientific/Artistic/Cultural Contents Statistics of web contents in Japan (by NDL) Problems of building Digital Archive Technology / Legislation / Organization Towards Better Digital Archive Technical problems Distributed crawling programs Huge storage systems using hierarchical architecture Social problems Intellectual properties (copyright law, creative commons) Self-introduction
Background of My Research: Mondou Search results by Mondou Related keywords provided by association rule mining Text/Web Mining Mondou 1996 Relevant keywords provided by text mining association rules Document clustering Information visualization Discover web communities Distributed and cooperative web robots
Differences between Search Engine and Web Archive Web Search EngineWeb Archive Crawling Freshness by time stamps and informative file types: html, text, pdf, doc and others Accurate crawling of entire web pages stored in target web sites, as rapid as possible Quality Focusing on special attributes and descriptions: - title, meta, hyperlink tags Quality control is strongly required - Original/Master copies - Archiving shots management Search Recall and Precision Results are sometimes influenced by commerciality. Simple and easy query input Difficulties of document searches - Historical change and heterogeneous keywords - Evolution of hyperlink structures Preservation Short time: several months - Almost of users prefer popular and fresh web pages. Long time: several centuries as paper, micro film etc. - migration, transformation
Adjunct researcher (2002-) in Digital Library Section at National Diet Library (Japan)
Roles of NDL KANSAI-kan Collaborative Service between East and West PhD. Thesis (45%), Journal & Magazine (29%), Reports of Grant-in-Aid for Scientific Research of MEXT(15%), Scientific Reports (7%), Asian Library (4%) Digital Portal in Japan Digital Library (in Meiji-era, during 1868-1911) (2007/7) 97,000 Titles, about 143,000 Books WARP (Web Archiving Project) 1,499Titles 46 Government Organizations, 1,907 Cooperative Organizations Dnavi 9,900 directories Add 1,100 URL/yr Investigate 2,300 URLs deleted among 5,600 URLs
Roles of NDL KANSAI-kan Collaborative Service between East and West PhD. Thesis (45%), Journal & Magazine (29%), Reports of Grant-in-Aid for Scientific Research of MEXT(15%), Scientific Reports (7%), Asian Library (4%) Digital Portal in Japan Digital Library (in Meiji-era, during 1868-1911) (2007/7) 97,000 Titles, about 143,000 Books WARP (Web Archiving Project) 1,499Titles (E-book, journal, article, white report etc.) 46 Government Organizations, 1,907 Cooperative Organizations Governmental contents are also edited, modified and deleted… Dnavi 9,900 directories Add 1,100 URL/yr Investigate 2,300 URLs deleted among 5,600 URLs
WARP (Web Archiving Project) The House of Councilors Consolidation of cities, organizations, universities etc.
Outline : Today s Talk Fragile Intelligence on Internet Disappearing Scientific/Artistic/Cultural Contents Statistics of web contents in Japan (by NDL) Problems of building Digital Archive Technology / Legislation / Organization Towards Better Digital Archive Technical problems Distributed crawling programs Huge storage systems using hierarchical architecture Social problems Intellectual properties (copyright law, creative commons)
Science 9,000 1900 Science 90,000 1950 Science 0.9million 2000 2001 B.C300 % of Archive 30 50% Alexander Library 0.5 million 32TB Surface Web:14TB (1 billion pages) Deep Web:7.5PB (550 billion pages) Web (Japan) 0.45billion pages 18.4TB Web (go.jp) 20 million pages 1.6TB 1PB 10PB 100PB Surface Web:167TB Deep Web:67 92PB Web Pages 20052003 Book, reports, others 782million 50PB=50000TB Books Public 4.8million 308TB Scan This Book! http://www.nytimes.com/ Books Current 3.20million 205TB Book Unknown 24million 1540TB
Statistics of Web Sites 2001 1 billion pages (Surface Web), 550 billion pages (Deep Web; 7.5PB) http://www.brightplanet.com/technology/deepweb.asp 2002 2 billion pages: English:56.4%, Germany:7.7%, French:5.6%, Japanese:4.9% http://www.netz-tipp.de/languages.html 2003 167TB (Surface Web), 92PB (Deep Web) http://www2.sims.berkeley.edu/research/projects/how-much-info-2003 2005, January Searchable Web pages 11.5 billion Pages in 75 Languages http://www.cs.uiowa.edu/~asignori/web-size/
Survey Report of Japanese Web Sites (by NDL, 2005) Web Data HTML Files about 44 million files Picture Files about 55 million files Estimated Total # of Files 450 million files Estimated Total Volume of Data 18.4TB jp domain:182,093 hosts go.jp hosts (2,336 hosts, 1.28%) Files 4.4% Volume 8.5% http://www.ndl.go.jp/jp/aboutus/bulkresearch2005summary.html
Digital Archives OrganizationFromCharacteristics Internet Archive1996 Wayback Machine (Fair Use) Austria, National Lib.1996/6Legislation Sweden, Royal Lib.1996/9Legislation Denmark, Royal Lib.1997/6Legislation Australia, National Lib.1997/6Discussion France, National Lib.1999Discussion USA, Lib. of Congress2000NDIIPP Finland, National Lib.2000/8Proposal of Legislation Britain, Lib.2001/52003, Legislation (non-print material) China, Lib.2003/1WICP Discussing Legislation for Networked Electric Publishing 2003/5 Korea, National Lib.2006/2OASIS, Discussing Legislation National Digital Library is under construction to open in 2008 Germany, National Lib.2006/6Legislation Japan, National Diet Lib.2002/6Middle term planning (2004)
Towards Better Digital Archives Preserve Fragile Born-digital Contents Academic/Scientific/Artistic/Cultural Resources Archive of Digital Information Technologies of Long Term Preserving Legislation of Long Term Preserving Organization of Long Term Preserving Organization National libraries for digital preservation projects IIPC (International Internet Preservation Consortium)
National Archive Library for preserving Digital Information National Archive Library for preserving Digital Information Organization Mandatory Belief Organization Mandatory Belief National Diet Library National Archives of Japan Public/Private Libraries NII Government National Diet Library National Archives of Japan Public/Private Libraries NII Government Legislation Law, Consensus Commons Legislation Law, Consensus Commons Law of National Diet Library Law of Libraries Law of National Archive Law of Museums etc. Intellectual Properties Copyright Law Copyleft/Creative Commons Law of National Diet Library Law of Libraries Law of National Archive Law of Museums etc. Intellectual Properties Copyright Law Copyleft/Creative Commons Technologies Architecture Mission-driven Technologies Architecture Mission-driven Internet Technologies Natural Language CJK including Vietnamese Information Retrieval Database Technologies Archive Technologies Internet Technologies Natural Language CJK including Vietnamese Information Retrieval Database Technologies Archive Technologies
Various Technical Problems Programs of crawling contents from surface and deep webs provided by dynamic web services emulation and migration of dynamic content Heritrix Collaboration and optimization of distributed systems preserve monotonously increasing digital contents crawling, storages, information retrieval with time-line Wera (Web ARchive Access), OAIS, DSpace etc. Metadata formats URI, RDF, MODS (Metadata Object Description Schema)
Hidden webs and archiving Advanced techniques KQML Mediator Wrapper Association rules Web mining Knowledge and rules derived from Metadata Repository Web summaries Web Servers …… Agents KQML Search Archiving Robots Web Archiving Systems (Metadata, Site Summaries, Frequent navigational patterns, Representative web contents) How do we archive contents stored in hidden webs?
Growth of Storage Market Trend of Storage Volume 10 times in 2010 2010: volume of storage 1370PB (10 times of volume in 2005) Growth rate 56.9%/year IDC Japan Storage Market in Japan Unit:\100M, TB JEITA \13.07M/TB \8.42M/TB \5.02M/TB \2.73M/TB Next DVD:25-30GB 2010 Holographic Disc:200GB-1TB Dell Others Hitachi
Architecture of hierarchical storage First Level Storage (plain files, full text search) Second Level Storage (compressed files, partial indexing) Third Level Storage (archiving multiple-files with compression, low cost devices) Cache Storage prefetch Operational Database of Archiving System (log files of web robots, search queries, navigational patterns)
Guidelines of Metadata http://www.loc.gov/standards/
Various Formats and Standards Resource Description Formats MARC 21 formats - Representation and communication of descriptive metadata about information items MARCXML - MARC 21 data in an XML structure MODS (Metadata Object Description Standard) - XML markup for selected metadata from existing MARC 21 records as well as original resource description MADS (Metadata Authority Description Standard) - XML markup for selected authority data from MARC21 records as well as original authority data EAD (Encoded Archival Description) - XML markup designed for encoding finding aids Digital Library Standards METS (Metadata Encoding & Transmission Standard) - Structure for encoding descriptive, administrative, and structural metadata (www.loc.gov/mets) MIX (NISO Metadata for Images in XML) - XML schema for encoding technical data elements required to manage digital image collections PREMIS (Preservation Metadata) - A data dictionary and supporting XML schemas for core preservation metadata needed to support the long-term preservation of digital materials.
Options (metadata) of OPAC and WARP OPACWARP Title Authors/EditorsAuthors Editors LocationStart URL YearDuration Category Category No. NDC, NDLC, LCC, DDC,UDC, GPO NDC Standard No. (ISBN, ISSN, CODEN, UTM, ISRN, ISMN etc. ISSN+ISBN Book ID JAPANMARC, USMARC, UKMARC, OCLC etc.) Management No.Meta ID Codes (Language, Original Language, Gov., Univ. etc. Japanese/Western Books, Digital Contents, Music/Video, Ashihara Collection etc. Collections NDL Resource
NDL-DA (NDL-Digital Archive) System is based on OAIS reference model Information Package consists of Content Information Metadata Organizing Unit Bibliography, Volume, Number, Article Web Site, Web pages http://www.ndl.go.jp/jp/standards/da/index.html
OAIS Open Archival Information System http://www.rlg.org/en/pdfs/rlgnews/news56.pdf Submission Information Package Archival Information Package Dissemination Information Package Descriptive Metadata Is stored separately
NDL: Meta Data Information Package Metadata Preserving contents and associated metadata Description Metadata Bibliography: Title, Publisher, Volume, Number etc. Technical Metadata CPU, Hardware, Operating System, Software etc. Preservation Metadata Long-term preservation: Ingest/Migration history etc. Rights Metadata Permission, Creator, Authority, Audience etc. Control Metadata Other Data for Preservation/Utilization/Management
NDL: Meta Data Information Package Metadata– METS1.6 METS Metadata Encoding and Transmission Standard Description Metadata – MODS3.2 and NDL-DA Metadata Scheme MODS3.2 (Metadata Object Description Schema) MODS is a derivative of MARC21, and it is not so complex Technical Metadata– PREMIS based Scheme Preservation Metadata– PREMIS based Scheme Rights Metadata– PREMIS based Scheme PREMIS (PREservation Metadata: Implementation Strategies) View Path is Preservation Layer Model in DIAS (Digital Information Archiving System, Netherland) Control Metadata– NDL-DA Metadata Scheme
Sample: Attribute Values typeOfResource based on MARC21 Text Cartographic notated music sound recording sound recording-musical sound recording-nonmusical still image moving image three dimensional object software, multimedia mixed material digitalOrigin based on MODS born digital reformatted digital digitized microfilm digitized other analog Japanese Kana: script or transliteration kokuritsu kokkai toshokan
Conclusion Web archive is one of dominant information infrastructure in digital information society. Technical problems Distributed crawling, long-term huge storage, advanced IR Social problems Intellectual properties (copyright law, creative commons) Huge volume and long-term preserving Distributed crawling programs Surface and hidden webs, complex web services Huge storage systems using hierarchical architecture Storage media, archiving formats, compression methods and rates Retrieving mechanism: navigational pattern mining in web archive Preserving strategies by importance and access frequencies Effective emulation and migration of dynamic contents
Discussion Digital Archives Infrastructure of Digital Contents Problems of Digital Archives Technology Collaboration of Standardization Legislation Consensus among Stake Holders Organization Store/Preservation/Utilization Towards Better Digital Archives Collaboration for Integrated Digital Archives Library, National Archive, Museum, University, Laboratory, Company etc.