Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Slides:



Advertisements
Similar presentations
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Advertisements

Microsoft Excel 2003 Illustrated Complete Excel Files and Incorporating Web Information Sharing.
Free Pascal compiler internationalisation Rimgaudas Laucius Institute of Mathematics and Informatics, Vilnius University Lithuania.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
Multilingual support; interface languages Course material prepared by Greenstone Digital Library Project University of Waikato, New Zealand andNational.
CIS101 Introduction to Computing Week 05. Agenda Your questions Exam next week - Excel Introduction to the Internet & HTML Online HTML Resources Using.
WMES3103 : INFORMATION RETRIEVAL
Macromedia Dreamweaver MX 2004 Design Professional Web Page DEVELOPING A.
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Chinese Information Processing (I): Basic Concepts and Practice Unit 7: Web Pages in Chinese.
Greenstone Digital Library Usage and Implementation By: Paul Raymond A. Afroilan Network Applications Team Preginet, ASTI-DOST.
15 September How Computers Work: Other Forms of Data.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
FIRST COURSE Creating Web Pages with Microsoft Office 2007.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Unicode & W3C Jataayu Software C. Kumar January 2007.
2.1 Different Text Attributes Font A set of printable or displayable text characters with its style and size specified Arial 16 point bold Arial 32 point.
Creating Interfaces: Localization Language & other issues character codes Homework: preparation for future topics.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Agenda Data Representation – Characters Encoding Schemes ASCII
1 Lesson 6 Exploring Microsoft Office 2007 Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Introducing Dreamweaver MX 2004
Tutorial 1 Getting Started with Adobe Dreamweaver CS3
Tutorial 1: Getting Started with Adobe Dreamweaver CS4.
Internet Vocabulary 1-21 State Test Vocabulary. Address address, Internet address, and web address. A code or series of letters numbers and/or.
CHAPTER FIVE TEXT.
CS117 Introduction to Computer Science II Lecture 1 Introduction to WWW and HTML Instructor: Li Ma Office: NBC 126 Phone: (713)
CISC105 General Computer Science Class 1 – 6/5/2006.
Modular InfoTech’s Modular Infotech is proud to offer Tools and Components enabled with Indian language so as to address each & every client located across.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Cis303a_chapt03-2a.ppt Range Overflow Fixed length of bits to hold numeric data Can hold a maximum positive number (unsigned) X X X X X X X X X X X X X.
Web Programming : Building Internet Applications Chris Bates CSE :
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge.
Your Search for Indian languages ends at Modular InfoTech, Pune Web-Samhita from Modular InfoTech Pvt. Ltd. Modular InfoTech is proud to offer various.
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
UNICODE & Indic Scripts
XP Tutorial 8 Adding Interactivity with ActionScript.
An ISO 9001:2008 Company With all the tools you need to compute in Indian Languages.
The physical parts of a computer are called hardware.
Greenstone Building your own collection. Overview Installation Usage Building a collection.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Building Database-backended Multilingual, Multimedia Data Repositories: The aAQUA Experience.
Lesson 5 MULTIMEDIA. Multimedia on the Web has expanded rapidly as broadband connections have allowed users to connect at faster speeds. Almost all Web.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Characters CS240.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Unicode in ALEPH Session Outline Key concepts Pre-UNICODE ALEPH ALEPH full UNICODE version Innovations in character conversion mechanism.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
The ___ is a global network of computer networks Internet.
Web Browser presentation Name/ Hassan AL-Abdulmohsen
TOPICS Information Representation Characters and Images
Text.
Devanagari Font Support For Linux
Tutorial 1.3 Using Element Attributes
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Web Programming : Building Internet Applications Chris Bates CSE :
Introduction to UNICODE (ஒருங்குறி)
Presentation transcript:

Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information

Table of Contents Introduction to Multilingual Digital Libraries Different Character Sets and Encodings Statement of the problem Objectives Need for the project Methodology Implementation System description Observations Limitations Conclusion Future developments

Multilingual Digital Library Library Digital library Monolingual digital library Multilingual digital library

Definition of MDL According to Ana M. B. Pavani “A multilingual digital library is a digital library that has all functions implemented simultaneously in as many languages as desired and whose search & retrieve functions are language independent”.

Terms related to multilingualism i18n (internationalization) Localization Multilingual digital library Multilingual documents ( ಕನ್ನಡ, हिन्दी, મં।ગેલ ) Cross-language Retrieval

Issues of MDL Multiple language recognition, manipulation and display. Multilingual or cross-language search and retrieval

Character set and Encodings Charset:- is a bunch of characters, in the way a human would understand them. Ex: ಅ, ಆ, ಇ, ಈ, so on are charset of Kannada अ, आ, इ, ई, so on are charset of Hindi A,B,C,D, so on are charset of Latin English Character Encoding:- is a way of storing characters on a computer as bits.

Different character sets ASCII ISO-8859 series Windows series User defined ISO Utf-8 Utf-16 Utf-32

Unicode Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Developed by Unicode Consortium There are many versions, current one Accommodates more than 65,000. Synchronized with the corresponding versions of ISO

Unicode Standards incorporated under Unicode ISO 6937, ISO 8859 series ISCII, KS C 5601, JIS X 0209, JIS X 0212, GB 2312, and CNS etc. Scripts and Characters European alphabetic scripts Middle Eastern right-to-left scripts Scripts of Asia Indian languages  Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam. Punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, etc.

Assigning Character Codes Unique number is assigned to each code element and is called a code point. These are the hexadecimal numbers with the prefix “U“ Ex,., U+0041 is the hexadecimal number "A". It groups the characters together by scripts in code blocks. Code blocks vary in size, depending on the size of the script. Code elements are grouped logically throughout the range of code points, called the codespace.

Text handling Computer text handling involves processing and encoding. The Unicode Standard directly addresses only the encoding action, processing will be carried out by software. It does not defines glyph images (character set images), display software retrieve the glyphs. The Unicode Standard does not specify the size, shape, or orientation of on-screen characters.

Objectives To assess the suitability of GSDL for developing digital library collection in Indian languages (Hindi and Kannada) To create search and browse interface for GSDL Software in Hindi and Kannada

Need Immeasurable amount of literature in many languages E-publishing in Indian languages E-governance in India E-learning Digital libraries for Rural population

Greenstone Digital Library Software Open source Developed by CS Department, University of Waikato, Newzealand Can handle different file formats Works on different platforms Supports for many languages through unicode

Multilingual support Interface part Content part

Methodology Software Windows XP operating system GSDL Macromedia Fireworks Nudi Baraha Internet Explorer 6.0 Hardware 128 RAM with Pentium III

Hindi and Kannada Interface Separate.dm files were created for both language _textimagehome_ {Home Page} _textimagehome_ [l=kn]{कि सु&#2330 } Creating tabs for Hindi & Kannada Hindi Tabs Macromedia Fireworks Baraha transliteration software Kannada Tabs Macromedia Fireworks Nudi transliteration software

Collection building हिन्दी काव्यालय : is downloaded from ಉದಯವಾಣಿ ಸಂಗ್ರಹಣೆ : is downloaded from हिन्दी Unicode collection ಕನ್ನಡ ಯೂನಿಕೋಡ್ ಸಂಗ್ರಹಣೆ

System description हिन्दी काव्यालय / ಉದಯವಾಣಿ ಸಂಗ್ರಹಣೆ : Susha/Shree-Kan-0850  Font folder Lang interface  Hindi/Kannada Preference encoding  Latin Based Browser encoding  Latin Based or User defined Hindi/Kannada Unicode collection: Mangal/Tunga for Hindi/Kannada  Font folder Lang interface  Hindi/Kannada Preference encoding  utf-8 Browser encoding  utf-8

Observations Can have interfaces in many languages. Can build collection in many languages with different encodings other than Unicode. Non-Unicode collection has only browse feature. Titles of the Non-Unicode collection were in English language. Unicode collections has both search and browse features. All collections can be accessed over network. cont…

Observations Uses MG compression technique. Can browse lists of authors, lists of titles, lists of dates, so on. Can handle very large collections. New data can be added to existing collection at any point of time. Open-source software; anybody can develop and it is amendable for local requirements.

Limitations Fails to display Unicode html files of Hindi/ Kannada It doesn’t support truncated searching for Indian scripts. Case differences option cannot be disabled in the preferences page. Presently search feature works only on Windows XP.

Conclusion Multilingual Digital libraries will be ubiquitous in the future and will provide the basis for a very broad set of distributed living activities including computer-supported co-operative work, distance learning etc. Developing countries like India, where many languages are in practice could utilize comprehensive software such as Greenstone. Since Greenstone, being open-source software is readily extensible to meet the needs of multilingualism.

Future developments It can be extended to other Indian languages for which Unicode supports. Display problem with html files can be solved for Indian languages by creating model mappings in utf-8 charset. Collection can be tested for different file formats like PDF, RTF, , etc. for other Indian languages. It can be tested with other operating systems like UNIX, Linux and browsers like Netscape, Opera to assess their compatibility. Can develop stemming algorithms for Indian languages, that can be incorporated to GSDL

Any Q’s ಪ್ರಶ್ನೆಗಳಿವೆಯೆ ? कोई प्रश्न ?

Thank you ವಂದನೆಗಳು धन्यवाद