Presentation on theme: "OUTREACHING WITH PRINT RESOURCES IN THE DIGITAL AGE Jidong Yang University of Michigan."— Presentation transcript:
OUTREACHING WITH PRINT RESOURCES IN THE DIGITAL AGE Jidong Yang University of Michigan
Problems with CJK encodings The slow expansion of Chinese encodings: from GB 2312 (about 6,700 characters), Big 5 (13,000 characters) to GBK (22,000 characters), GB 18030 (27,000 characters), GB 18030-2005 (more than 70,000 characters) and Unicode Version 5 (similar to GB 18030-2005). Not all computers have all the characters. Many existing databases are built on earlier encodings. Mainstream Japanese encodings: JIS and EUC, each has less than 7,000 kanji characters.
The issue of OCR accuracy When handling contemporary Chinese and Japanese publications in good conditions, the best OCR software can hardly achieve an accuracy rate better than 95%. When processing pre-modern CJK texts, the OCR accuracy drops down to 30-40% or even lower. Many database companies keep their OCR accuracy rate secret.
The early stage of digital scholarship New research methods and tools suitable for digital resources are still rare and need to be invented. A great number of research tools in print formats still retain their values, at least for now.
Databases vs. print indexes How to find information about Kumārajīva in the Gaoseng zhuan ? Search by Jiumoluoshi ? Not enough! Try: Jiumoluoqipo, Shi, Shigong, Shishi, Tongshou, and Luoshi. ––– All can be found in Ryō kōsō den sakuin, compiled by Makita Tairyō. Databases are not necessarily better than print indexes.
Conclusion The computer still cannot match the book in the capability of presenting the full range of East Asian languages and cultures. Print resources are still necessary for most serious researches on East Asia. Its our job to make the value of our print collections known to the patrons.