National Library of China

National Library of China
Web Information Preservation at National Library of China 中国国家图书馆的网络信息资源保存试验项目 Chinese-European Workshop on Digital Preservation • Beijing 中欧数字资源长期战略保存研讨会年7月13-17日 Wang Zhigeng/王志庚 National Library of China Good afternoon, Ladies and Gentlemen， My name is Wang Zhigeng. I’m currently the team leader of a web archiving project at NLC, the National Library of China. It’s my pleasure to give you some ideas about what NLC has done and is doing now in long-term preservation of digital information. Your suggestions and comments are warmly welcomed.

Outline 要点 Brief introduction to NLC 国图简介
WICP (Web Information Collection and Preservation) Project WICP试验项目 Other efforts 其他 My presentation consists of three parts.First I would like to take 5 minutes to briefly introduce the National Library of China itself, before I go further about its digital preservation efforts. And then I will focus on WICP project, the web archiving project at NLC. I will talk about the objectives, technical solutions, working models, workflow and other issues concerned with this Project. Finally ,I will introduce some other efforts that NLC is doing in order to preserve the digital information for a long period of time.

Brief introduction to NLC 国家图书馆概况
Established in 1909, a history of 95 years 始建于1909年，拥有85年的历史 Open to the public on August 27, 1912 1912年8月27日对公众开放 Started accepting legal deposit copies in 1916 1916年开始接受国内出版物的呈缴本 A Collection of 24 million volumes 馆藏2411万册件，是亚洲最大的图书馆（） NLC was established in 1909 and has a history of 95 years up to now. In 1912, the library started public service. In 1916 the Library started accepting the legal deposit copies of the national publications. Currently, the total collection has reached over 24 million volumes. It has become the largest library in Asia in terms of collection and floor area.

Main building/总馆 Branch library/分馆 New building/新馆
NLC is composed of three parts: the main library, the branch library and the new building which will be completed in The total floor area is about 240,000 square meters. This is the library building completed in 1987, it is today’s main library. This is the Capital Library completed in 1909, it is now the branch library. And this photo shows the new library building that is to be completed in 2007. Branch library/分馆 New building/新馆

Key functions of NLC 国家图书馆的职能
A deposit library 国家总书库 The National Bibliography 书目数据中心 References, reading and loan services 读者服务 To see more at The key functions of NLC A key function of NLC is to act as a deposit library for publications. The aim of a deposit library is to collect the publications, preserve them, and provide permanent access to these information for use in research, education, or any other purpose in society. Additionally, NLC houses special collections of rare and ancient books, atlases, rubbings and manuscripts. It edits and maintains the National Bibliography System. It also provides reference and lending services, offers seating for 3300 patrons, You can learn more about our library at

Preservation of printed documents 纸质文献资料的保存和保护
Adequate storage conditions 国际一流的善本书库和书刊保存本库房 Microfilming and digitizing 馆藏文献的缩微化和数字化 The National Library of China always attaches great importance to preservation and protection of all kinds of documents. In order to preserve printed publications permanently, NLC has established first-class rooms for rare books and copies needed to be preserved for a long time. These rooms provide proper temperature, humidity and illumination for long-term preservation of documents. Advanced technologies have been used in these rooms to protect against worms, fire, water and pilferage. In addition, NLC has also made microfilmed and digitized copies for some important printed documents.

Preservation of digital information 数字信息长期保存和保护
Not easy, different from the preservation of printed materials 与印刷型文献的保存不同 Legal deposit polices, intellectual property rights 缴送制度、著作权等法律和制度问题 Preservation of digital environment and metadata 数字对象的保存、数字信息环境的保存、保存元数据等技术问题 Organizational, social and economic issues 组织机构、社会分工和经济模型问题 The National Library of China also attaches importance to preservation of electronic publications, audiovisual materials, and digital information, including born-digital resources and digital reproductions. It is not an easy job to preserve digital information for a long time, which is quite different from the preservation of printed materials. Many issues must be considered with regard to preservation of digital information, such as legal deposit polices, intellectual property rights, digital preservation environment, technical issues like metadata, as well as issues related to organizational, social and economic factors. The National Library of China plays an active role in solving these problems by operating an experimental web-archiving project. Based on results achieved from this project, NLC will be able to formulate policies and strategies for long-term preservation of digital information.

Our Understanding in Web Archiving 国家图书馆的基本认识
Web information resources have been a major part of Chinese civilization and digital heritage, which should be properly preserved and protected. 网络信息资源是中华文明成果，是中华数字文化遗产的一部分，应该得到妥善地保存和保护。 They bear a strategic meaning for NLC’s collection development and public services. The NLC should collect web information resources as what has been done for paper-based materials. 网络信息资源对国家图书馆的馆藏发展和服务具有战略意义，国家图书馆应该像收集传统的图书资料那样，全面收集各种网络信息资源。 It is our understanding that Web information resources have been a major part of Chinese civilization and digital heritage, which should be properly preserved and protected. They bear a strategic meaning for NLC’s collection development and public services. The NLC should collect and preserve the web information resources as what has been done for printed materials.

Why preserve web information? 为什么保存网络信息？
Increasing masses of information published through the web 越来越多的信息以web的形式发布 Volatility of web information, the average life span of web pages is 75 days. Web的挥发性，网页的平均寿命为75天 A new dimensional space for social culture 社会文化的一个新维度空间，是现代的文化遗产 Many early web pages have disappeared 早期的web信息已经消失！ Why should we preserve web information? It is known that more and more information is appearing on the web. According to the report made by CNNIC (China Internet Network Information Center), as of December 2003, there were 311,000,000 web pages in Chinese, The total number of people who use Internet was 79.5 million. Unquestionably, Internet has become one of the most important ways for people to get information. According to the report by Alexa Internet, The average life span of web pages is 75 days. This rate of decay means that without collection and preservation there is a danger that invaluable scholarly, cultural and scientific resources will be unavailable to future generations. If we can not preserve today’s web pages in time, we will lose them forever. Web pages have become a new dimensional space for social culture and a cultural heritage in today’s world. Our life has been flooded with net languages. Many young people seek their love through the web. They read stories, play web games and go shopping on the web. And if they wish, people can also finish their college courses through the web. Internet has become an important spreading channel for contemporary culture. However, many early web pages have disappeared. Better preservation and protection of current web information is therefore an urgent task for our generation.

WICP model WICP 示意图 To preserve the web pages, in 2003, the NLC launched Web Information Collection and Preservation Project (WICP). As we all know，Web-based resources can be divided into two categories,（指示） that is,. surface web and deep web. We take different capturing strategies for these two kinds of web pages. A robot is used to capture surface pages, while a legal deposit policy is adopted to get deep web pages. A selective collection approach is used in WICP project. Both websites and web pages are collected and preserved. Websites are stored in Mirror Archive, while web pages are stored in Subject Archive.

Mirror Archive 镜像存档 First of all, i would like to introduce the working process of Mirror Archive. Let me show you an example. As you can see, this is the website of Beijing Municipal Government. In the homepage, you will find many columns, such as economy, travel, etc.(itsetral).And each column is further divided into different sub-columns, such as traffic, eating and shopping. Each sub-column consists of many individual subject items, such as Peking roast duck that we ate yesterday. When archiving this website, all of these elements, including columns, sub-columns, and individual items, will be captured in one package, with the structure and relevant links kept intact. The capture frequency, of course, will be controlled, and a data check mechanism will help us ensure the completeness and correctness of captured objects. Based on these correct and complete information, we will be then able to index them as bibliographic records. We use a robot （指示）to collect websites starting from the homepage. The original structure and links are kept for the captured web pages. An identifier is then assigned to one downloaded website package for preservation purpose. Due to the frequent updates, the same website will be captured at different time, thus forming many versions of one specific object（指示）. These captured websites are indexed by using the core elements in Dublin Core. All metadata are finally inputted in the National Bibliography （指示）.

Mirror Archive 镜像存档 Work flow/业务流程 Survey of target website/目标网站调查
Capturing conditions/采集条件设定 Starting capture/开始采集 Cataloging/元数据编目 Quality control/质量检查 Downloaded website registry/信息单元登记 Providing service/提供服务 Here shows a complete workflow of Mirror Archive: No. 1, Survey of target website: we’ll determine whether a website will be archived from three aspects, namely, content, copyright and technical feasibility. No. 2, Setting capturing conditions: such as capturing frequency of a robot. And then, a robot is activated to start the capturing process. In the fourth step cataloging , it involves indexing Metadata items: such as website name, author, publisher, launching date, classification, subject words, resource type, URL of the Homepage, etc.. At the fifth stage, that is, Quality control, we will check the validity and integrity of content and representation, as well as if they are accessible. After the downloaded website passes quality control, it will be registered in the metadata system. Our ultimate goal for archiving these websites is for providing service, helping our customers get the web information they need. As far as WICP is concerned, it is currently accessible by LAN, and hopefully will be open to the general public in the near future.

Mirror Archive 镜像存档 Collection statistics/馆藏情况
Government Information (.gov)/政府网站 E-journal /电子报刊 Chinese Studies/中国学 Here is a list of statistics of the Mirror Archive collection , as of June 20, 2004. With regard to government information : 57 government websites (dotgov) have been archived, including websites of central state organs, ministries and commissions of the State Council, of provinces, municipalities, autonomous regions and major cities. About E-journals: 34 websites that offer free full-text access to magazines and newspapers have been collected As for websites on Chinese studies: 25 domestic and foreign websites have been included in Mirror Archive.

Subject Archive 专题存档 Well, I have talked so much about Mirror Archive, and now I’ll turn to focus on introducing another type of archiving system in WICP, that is, Subject Archive. As you know, on the Internet, there are many web pages concerned with the same subject, such as historical events like SARS and Beijing Olympic Games （指示. They are from different websites, portals, search engine results and chat rooms. Here I use one color to show one specific subject, the yellow ones （指示） represent the subject of SARS, and the white ones（指示） represent the subject of Beijing Olympic Games. The robot can collect the web pages on the same subject, preserve them in the Subject Archive. It also can automatically take the metadata out from the source file of the web pages, make index to the web page, and put the metadata into the database.

Subject Archive 专题存档 Workflow/业务流程 Selection of subject/主题的选择
Survey/对象调查 Capturing conditions/设定采集条件 Starting capture/开始采集 Metadata mining/元数据挖掘 Object downloading/网页快照 Data storage/数据保存 Quality control/质量检查 Providing service/提供服务 This slide shows the Workflow of Subject Archive First of all, based on evaluation of the importance, influence and duration of one event, we’ll decide whether it is necessary to establish a new Subject. This step is called “selection of subject”. The second step is Source survey: and this process involves a survey of content and downloading technical feasibility of portals, search engines and chat rooms. Before activating the robot to start capture, our staff will set the capturing conditions for the robot, which includes depth, extent and frequency of capturing activities. With all these work completed, the robot is now ready to start capture 5. Metadata mining：As I have mentioned in the previous slide, the robot will automatically extract metadata elements, including title, author, publishing date and time, original URL, as well as conduct automatic classification and indexing, abstracting content, giving keywords and creating an unique identifier for each web page. 6. Object downloading : web page will be downloaded as a object. 7. storage: metadata and downloaded pages are stored respectively in metadata database and object database. 8.Similar to Mirror Archive, a Quality control mechanism is also applied in Subject Archive, and this process focuses on checking the metadata validity of captured web pages. 9.Speaking of providing service: like Mirror Archive, Subject Archive can only be accessed by LAN at present.

Subject Archive 专题存档 Collection statistics/馆藏情况
2008 Beijing Olympic Games/ 2008北京奥运会(ongoing) SARS/非典专题(Finished) The manned space flight project/中国载人航天工程(Finished) Media report about NLC/国家图书馆媒体报道(ongoing) Library studies and information science/图书馆情报学(ongoing) As of June 20, 2004, we have established five subject archives, they are Beijing 2008 Olympic games: including 300,000 pages(ongoing) SARS: including 500,000 pages(Finished) The manned space flight project: including 220,000 pages(Finished) Media report about NLC: including 13,000 pages(ongoing) Library and information science: including 30,000 pages(ongoing )

Some issues to be addressed 存在的问题
Web robot/网络机器人技术 Storage/海量信息存储技术 ……… We have collected lots of web information since we launched the project. And some technical problems are needed to be solved, for example , We need to enhance the ability of automatic classification, indexing and abstracting, automatic duplicate removal technologies. In addition, to find a better strategy for data compressing, decompressing and data transferring also poses us a great challenge.

Other efforts 其他工作 Suggestions for policy-making
建议网络信息资源作为呈缴对象写进《中国图书馆法》 Technological attempts of digital information preservation,such as reformatting and migration 数字信息资源保存技术方面的尝试 Just now I have talked about the WICP project. In addition to web resources preservation, the NLC is now actively doing some research and testing work in long-term preservation of e-books, e-journals, digital audio and digital video. Currently, the NLC is also working on the Submission and Preservation System of doctoral dissertations. This system is aimed at collection and preservation of nation-wide doctoral dissertations. PhD students can submit their dissertations through the web in .doc or .pdf format. we have also offered some suggestions for policy-making, with a view to creating a good policy environment for conducting digital information preservation in China. In the meantime, we have also made some reformatting and migration for some digital collection, so as to realize long-term preservation. In 2003 the National Library of China offered a proposal to the drafting committee of Chinese Library Law and advocated that web-based resources should be regarded as legal deposit in Chinese Library Law (draft). Such a legal stipulation will ensure NLC’s right to download, copy, preserve web information and serve the public. At present, drafting of the law is still in progress. In the field of Technical try-outs: we has made some technological attempts in digital information preservation. It has reformatted and migrated some CDs, VCDs and DVDs. As of December 2003, 387,418 songs were reformatted as mp3 with a total of 2.2 TB. 10,598 hours of video were reformatted as MPEG-4 with a total of 6.7 TB. In the near future, the NLC will make some tests of technology preservation and emulation.

Cooperation 合作 Although there has not come up with an effective way to preserve digital resources, we are ready to work with all colleagues in library community to preserve digital information. 目前还没有一种行之有效的数字资源的保存技术和策略，我们愿意和国内外的所有同仁一起为保护人类共同的数字文化遗产而努力。 Despite the challenges in finding an effective way to preserve digital resources, we are ready to make joint efforts with all colleagues in the library community to preserve digital information, which is an indispensable part of our cultural records and an invaluable heritage for future generations. Thank you!

Thank you ! 谢谢！ Thank you!

National Library of China

Similar presentations

Presentation on theme: "National Library of China"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

National Library of China

Similar presentations

Presentation on theme: "National Library of China"— Presentation transcript:

Similar presentations

About project

Feedback