Multilingual Computing

Multilingual Computing
Dr. Lu Qin(陸勤), Rm PQ 814, ext 7247 Course Material on-line: Lecture notes available: Friday 14:30 previous week. Lab/tutorial hand-outs: Friday 14:30 previous week Schedule and announcement on-line office hours: 2:30 – 3:30 Tues, 2:30 – 3:30 Thurs Labs : Mr. Joe Lam, Tel , Rm QT406 Text book: CJKV Information Processing, by Ken Lunde, O’Reilly, 1999 See web site under Lecture 1

Teaching and Assessment
Lectures (fundamentals) Introduction Characteristics of different languages(scripts) Computer representations Input Processing & Output Processing Information processing techniques: Open systems Internationalization and localization Algorithms Software development for multilingual environment Introduction to natural language processing Tutorial/labs (gain experience in using some common Chinese Operating System and programming), MS Chinese Windows Different programming environments Assessment: 60%Final, 15% Midterm, 20% Proj & Hwk (15% +5%), 3%class participation and 2% punctuation The assessment of the course is measured by two aspects, namely continuous assessment and final exam. Please understand that the ratio of grades for continuous assessment and final exam is 4/6 fixed for the program. Thus 60% of the grades come from the final exam. For the continuous assessment, I have divided it into 5 components: the mid term exam(15%), the program assignments(15 %), Homework(5%), class participation(3%) and punctuation(2%). In order to encourage students to ask questions, I will give out 3% for any student who would ask at least 3 questions during my lectures and tutorials. Students who come to class on time for over 90% of the time will also get 2%. During tutorials, I will review lecture subjects and give out in class exercises, some from old exam questions. Exercises will not be given marks. Sometimes, I can give out extra exercise. I do encourage students to discuss questions, as it is one of the best way to learn. If you are not sure of the answer, I am happy to discuss with you. But I will not give out “model” answers. Lab assignments will be given during lab and they should be completed in the lab, but will not be marked. Questions, related to lab assignments will be given in mid term exam and the final exam so it helps if you have done it yourself. There will be two programming assignments(projects) most likely of a team project of two persons and the team will be assigned by the lecturer. The projects will be graded according to 4 major criteria, namely, correctness, design/efficiency, GUI, and documentation. Lecture 1

What is Multilingual Computing
Computer processing of data related to more than one language/scripts including any human-computer interaction activity where communication is achieved Bilingual, trilingual, vs. Multilingual Fundamental issues: Dealing with different languages and each language has there own characteristics which requires expert knowledge of each language Example: count the number of words: “Multilingual Computing” vs “多語言文字處理技術” Ways to distinguish different scripts How can a system be designed so that it can be used by different languages with minimal changes How can a system be designed so that it can be used for multiple languages The title of this course has two key words: multilingual and computing. Multilingual means that we will be studying issues related to different languages, computing deals with how to handle/process these different languages on a computer system. As an introductory course, we are interested only in the written form of the language, which we call scripts(文字) or text. The term multilingual generally means the co-existence of more than one language. The term bilingual means two languages. For example, you say a person is bilingual if he speaks both Chinese and English well. Trilingual means 3 languages. We normally say that Hong Kong is a trilingual society as we use English, Chinese and Cantonese. It should be pointed out that the written languages are bilingual in Hong Kong, but the Chinese especially on the spoken part can be subdivided into Cantonese as a dialect and Putonghua. Q1: Is Hong Kong a multilingual society? Q2: Give some examples of multilingual societies. When dealing with different languages in a computer, there are some fundamental issues which we need to deal with. Firstly, we need to know the characteristics of each language and how to represent and process each of them in a computer system. Secondly, when more then one language is supported in a computer, we need to have methods to distinguish them. Q3: Why do we want to distinguish them? Example: count the number of words: Multilingual Computing 多語言文字處理技術 From system design’s view point, we need to know how we can design a system so that with minimum change to the system, we can support an additional language(starting from one language), and also how the system can be designed so that different languages can co-exist. Lecture 1

Different Scripts(Written languages)
English: Fixed alphabet, words are naturally delimited by SPACE, more morphological changes but very regular, more of a token based language than a phonetic based language, writing from left-to-right Example: auto, automatic, autonomous, automation, Auto-movement, spelling is easy to do Phonetic transcription system: Pinyin, Jyut Ping(粵拼), International Phonetic Alphabet(IPA) Korean: Kanja (漢字) similar to Chinese, Hangul is a two dimensional Pinyin system. In other words, Hangul is a phonetic script or phonetic transcription system. Scripts(文字) normally refers to a writing system which has its own writing symbols and when written following certain rules form the written form of the language. In our context, scripts refers to the written part of a language which we call text. Han Zi(漢字）refers to the ideograph writing system and a collection of actual characters. Cantonese, however, are normally not considered a script, but a dialect because it uses the Han Zi writing system, e.g. the ideograph characters. The word formation rules(構詞法) in Chinese are quite loose because the morphological rules (詞法規則） are quite arbitrary even though there are still some such as姐姐(repetition), 地震 and 口吃( sub-verb -> noun), 匯集 and 尺寸(parallel). English, again, is not just a spoken language. It has both its written form (script) and a spoken part. There are more regular morphological rules in English than there are in Chinese, In English, the morphs(詞素) themselves， such as auto-, anti- etc, carry meanings and can be combined with others to give certain specific meaning. English, written by spelling, is more of a token based language because many words are “grown” from a word “stem” (a morph), such as “auto”, and different words would then “grow” from the “stem” word. Take the “stem” word Auto as an example, different variants can then be generated, such as automatic, automation, autonomous, etc. The word modern, as another example, can grow more words. Q: Think of other examples of English morphs (stem words) and their variants. English has a relatively small alphabet making its representation in computer easy. The pronunciations of words in English can be predicted in most of the cases. However, English is normally not considered a phonetic based system (or a phonetic transcription system) because the spelling of a word cannot tell the exact pronunciation of that word. For example, the words can is pronounced as [kæn] and cane as [kein] where the same sequence –can- has different pronunciations, or like in tailor but carpenter where the same pronunciation [ə] at the end of the two words are spelled differently. In a phonetic transcription system, the symbols dictate the sounds only. Two different transcripts should corresponding to two different pronunciations. The Korean language has two forms of scripts: Hanzi (Chinese characters) and Hangul. Hangul is a phonetic transcription system, or we call it Pinyin(拼音) system. In fact, every language based on Chinese characters do have a parallel phonetic transcription system. In China, we call it Hanyu Pin Yin（漢語拼音) system, in Taiwan, it is called Zhuyin(注音系統). In Japan, the transcription system is called Kana(假名). It should be good to know that Cantonese also has a phonetic transcription system, the Jyut Ping system(粵語拼音系統)developed by the Linguistics Society of Hong Kong. Lecture 1

Korean Hangul KA KEU NGOA SAN NUN KOAEN
Romanization: Using Roman letters to denote the phonetic transcriptions Here is an example of Korean Hangul. Korean Hangul can be considered as two dimensional phonetic transcription system(transliteration system). KA KEU NGOA SAN NUN KOAEN(see pp36 – 37 Jamo table) Using Latin letters to denote the phonetic transcription system is called Romanization Romanization has the advantage of using English alphabet so you can use any English typing devices, computer keyboard, and typewriter, etc.. Q: Give other examples of Romanized transcription system. Lecture 1

Japanese Kana Hiragana(phonetic): can be used completed without any Han characters, often used with Han characters (Hanji), for Japanese/Chinese native words Katakana(phonetic): denoting only foreign words Writing either from left-to-right or top-to-bottom for both Hiragana and Katakana as well as Han characters In Japan, beside Kanji(漢字) which was a borrowed writing system from China, Kana(假名) is their native scripts(even though the symbols were also influenced by Chinese writing scripts). Kana, again is a phonetic transcription system where Kana has two distinct written forms Hiragana(平假名) is for native Japanese, whereas Katakana(片假名) is for foreign words(direct phonetic transliteration of foreign words) Japanese can be written horizontally from left to right or vertically from right-to-left. Kana also has a Romanized equivalence: Kyoto(京都), Tsukuba(築波). In fact, in the above table, the Romanization symbols are given by the column names and row names. Lecture 1

The Chinese Language General Characteristics
Sino-Tibetan Language Family (漢藏語系 ) Ideographic in nature (表意文字 ) 50+ languages in PRC Hanyu the official language 7 Major Hanyu dialects Hanyu Dialect similarities relatively unified writing system some dialect-specific characters and variant character writing Hanyu Dialect differences different pronunciation across different dialects different words (e.g. 係 and 是 ) word-order reversal (e.g. 找尋 and 尋找) different expression / grammar (e.g.先坐 and 坐先) The Chinese language is categorized as Sino-Tibetan Language Family by geological divisions. The characters reveals the meaning of the character it represents, thus it is called ideographs(表意文字). In the People’s Republic of China, there are more than 50 different languages(in spoken form) and some has their unique scripts such as the Mongolian script, Tibetan scripts, etc. Hanyu(漢語) or Mandarin Chinese is the official language. Its written form is Hanzi(漢字). Under Hanyu, there are seven major dialects as listed, it is obvious why Mandarin is chosen as the official language(dialect). Many people in Hong Kong like to quote the story that Cantonese was voted one ballot short to become the national spoken language in Sun Yat Sen(孫中山)’s government. Even though there are so many different spoken forms of the language, the written Hanzi system is very much the same throughout China even though there are dialect specific characters. The phonetic information of a Chinese character is not explicit. Sometimes you can guess the pronunciation through the component characters, sometimes, the pronunciation has no relation to its components. Thus making the learning of Chinese difficult without a phonetic transcription system. Let’s take a closer look at a Chinese character. Lecture 1

Graphemics ( the look, 形 )
Chinese Characters Graphemics ( the look, 形 ) Strokes (distribution 1-30+), Radicals (214+), components(500+), Characters (65,000+) Stroke sequence order Variant writing (e.g. 教都 ) Character Formation Bounded radicals and components, but unbounded alphabet / character set (charset) 6 principles - ideographic 象形 (火) , objective 指事 (一二 ), meaning會意 (炎旦), ideo-phonetic 形聲( 訪), borrowed假借(孰熟), transitive 轉注( 考老) Each character is associated with three features, namely its look, called graphemics, its pronunciation, called phonetics, and its meaning, called semantics. The graphemics defines the character glyph(字形) which can also be described by strokes(one connected single pen movement), Radicals(部首) which are used mainly for indexing and classification, components(部件), and characters. Strokes of a character is supposed to be written in a fixed order. But for historic reasons, the same character can have variant forms, called variants (異體字). Variants by the traditional definition are two characters which can replace each other in any running text(正文，文本) without changing the semantics of the text. Chinese characters are formed historically through some basic character components and there are 6 major classifications of word formation rules. Lecture 1

Character Decomposition
Most basic elements of characters are “Strokes”(筆畫) 基本的“一”（橫）、“”（豎）、“”（撇）、“、”（點）和“”（折）。 Chinese components(部件) is composed of strokes which can be considered a functional unit and they can reflect the meaning, pronunciation and origin of the characters See Chinese character variants(異體字): and鳥 for birds, thus, and It is helpful to think of a Chinese character through its components. Chinese radicals are components. Most of the Chinese radicals can be used for indexing purpose. Characters having the same radicals are grouped together and ordered by their stroke count. Characters with the same pronunciation and meaning, can have different glyph shapes identified by the use of different components. We call these characters variants. Variants may be formed due to the use of different ideographic components that have similar meanings. For example, since both ideographic components “ ” and “鳥” symbolize “bird”, the character for “chicken” takes both “ ” and “ ” as two variant forms. You can go to the search system in the website for some more examples of Chinese character decomposition. In the traditional sense, the difference between a traditional character(繁體字) and a simplified character(簡體字), such as 們(们)in 我們（我们） can also be considered as variants. But, since, simplified characters are changed through a systematic method, and it can also be used to produce derived simplified characters(類推簡化字)， officially, simplified characters are not called variants of traditional characters. Lecture 1

Phonetics ( the sound,音)
Phoneme( 音素單音 ) - contrastive unit of speech (e.g. bag and tag) vows（元音） and consonants（輔音） Putonghua: single consonants, can be double vows: b, p, m, f, a, o, e, ai (two phonemes), Cantonese: kwok, cheung, ng One-character-one-syllable: mono-syllable Tonal language - tone differentiates meaning Putonghua: 5 tones Cantonese: 9 tones(?) Semantics (the meaning,義 ) meaning may derive from components of character (e.g. 廳) Single-character words have multiple-meanings( 樂) Multi-character words usually have less ambiguity( 快樂音樂 ) Writing from left-to-right and also from top-to-bottom Pinyin system, Zhuyin system(only for learning characters, not as general reading tool) The second feature of a character is its pronunciation. Pronunciations are categorized by vows(元音)，which have no obstruction of air flow when pronounced and consonants(輔音)，which have some form of obstruction of air flow. Most Chinese characters are mono-syllable characters（單音節）when pronounced. A mono-syllable (or single syllable character) character is pronounced in the pattern of ( c) v ( c ) where the c in bracket can be a nil sound. Some characters, for example some units of measurements, can be multi-syllable characters（多音節）, such as “瓩”, “嗧”, etc. It should be noted that a Chinese character can be associated with more than one pronunciation even in the same dialect. Sometimes, different pronunciations signifies different meanings, but that is not always true. Q: Give some examples of multi-syllable characters Chinese is also called a tonal language because different tones represents different characters. Examples are: Pinyin: zhang1張, zhang3漲; Jyut Ping (粵拼): ciu1超, ciu4朝 The third feature of a character is its meaning. Each character can carry some meaning, if the meaning is independent of other characters, the character is also called a word as it has an independent meaning, such as 人 and 我. Some characters, when used with others still carries its own meaning, such as in 人民. Some may not be the case, such as in 躊躇. some characters can be used alone (called a free morph ), but some cannot be used alone (called a bound morph), such as 阿，者, etc. In the case of人民,人 is a free morph whereas民 is a bound morph. Lecture 1

Character Set A character set is a collection of characters. The set usually has a name, such as, KangXi character set, etc. Usually, each character in a character set is unique. C ={ci| 1<i<n, ci is a character} Computer processing of a character set requires that that each character in a character set is assigned a unique binary value Encoding: Is the process of mapping a character to a numeric value A coded character set, normal referred to as a codeset CC, can be considered as a set of tuples: CC={(ci, codei) |ci C and codei  CODE } where codei<>codej if ci <> cj, & CODE is normally a set of integers in binary form and CODE is also called code space Each language is defined by a set of symbols which are not considered divisible when being used. For English, the set of alphabet is such a set which we call a character set. The English alphabet is the basic symbols used in the language, but the smallest language unit relies on the Alphabet being spelled into a set of vocabulary(詞彙). In Chinese, it is not practical to consider that the smallest representations of characters is the strokes as strokes are not the unit of meaning. The smallest unit that can be used to represent the language is a character, thus normally we consider that characters are the smallest units in our language. It should be pointed out that the English alphabet is a closed set which has a fixed number of elements whereas Chinese characters is an open set where the number of elements is not fixed in general as new characters can be created at different times. As computers works well with a finite set, it is practical to introduce the concept of character set which has a mathematical meaning and can be easily represented by computer system. Generally speaking, a character set is a named set with a finite collection of characters( or symbols). The membership selection in a character set depends on the nature of this named set and it is usually selected with some reason or purpose or simply some arbitrary selection. For example, all characters in Kang Xi dictionary(康熙字典) is a named set (named as Kang Xi characters) which has a finite set of characters, and thus it is a character set. In mathematical terms, there is no concept of order in a set, the written order of a set is only for convenient and it does not indicate the importance or any other significance of the members. However, each member in a set must be unique. It is important to know that for a named set, we can always use the function in( ) to check whether an element belongs to the set or not belong to the set. Lecture 1

Note that CODE is a set of numbers usually in consecutive orders.
Examples: Suppose CODE1={00, 01, 10, 11}, CODE2={0000, 0001, 0010, 0011}, CODE3={1000, 1001, 1010, 1011}, CC1={(ci, codei) |ci C and codei  CODE1 } CC2={(ci, codei) |ci C and codei  CODE2 } CC3={(ci, codei) |ci C and codei  CODE3 } Where CC1 , CC2 , and CC3 are different codesets! A codeset can also be considered conceptually as a character set with a predetermined order and the order is determined by the numerical value in CODE Length of binary/order depends on the size of (C) or some predetermined number Codepoint: a value in the code space For Chinese, since there are more than 256 characters in the set, at least 2 bytes (at most 64k codepoints) are necessary to represent all the Chinese characters. A character set can either be non-coded or coded. A coded character set, sometimes referred to as codeset, is a character set where each element in the set is given a unique numerical value. The complete set of values in CODE is called the code space. Computer systems, can support coded character sets only where each character as a member of the set is represented through a binary number and this number must be unique for each such symbol. A value in a code space is called a codepoint. The codepoint values are defined in the set CODE. It should be noted that CODE conforms to the mathematical definition of a set. Therefore, an element “0000” and “00” are considered different elements. In the examples, CODE1 , CODE2 and , CODE3 are considered different sets as their elements are different. This is important because different lengths of the same numerical values in computers are considered different and will be handled differently. The mapping of a character in a code set to a numerical value is called codepoint assignment. An encoding method explains how a character is being mapped into a code-point. Q: For a character set with n number of characters, what is the minimum number of bits required so that all characters can be given different values? Q: In a computer, the smallest unit of representation is a byte(8 bits), what is the minimum number of bytes required for n number of characters? Lecture 1

Numerical Notations Decimal notation (10 distinct values): no prefix
Binary notation (2 distinct values): Hexadecimal notation: 0xHH where H: 0 ..9,A..F Hexadecimal notation is normally used to replace binary notation for better readability 1 to 4 binary digits -> 1 Hex digit Scalar value: the actual numeric value for any fixed digit numbers: scalar( 0001) = 12, scalar( 0111) = 716 , scalar( 01111) = F16= 1510= 11112 In computer, 00AF and AF represents different things, but they have the same scalar value. The most commonly used numerical notation is decimal numbers where there are 10 distinct numbers, 0 to 9. For any number 10 or beyond, we have to use the position of the number relative to the decimal point to denote them. For example, 56 = 5*10 + 6, but not 5+6, 561 = 5* *10 + 1 For the same reason, in the binary notation = 1*24 +0*23 + 1*22 + 1*2 + 0*20 For each number with n digits and base N, the value of a number is D n-1D n-2, … D1 D0 = D n-1*Nn-1 + D n-2*Nn-2 + ……+ D 1*N1 + D 0*N0 The total number of different values it can have is Nn. In the Hexadecimal notation, the numbers are represented by 16 distinct values, 0, …… 9, and A,….., F. The Hexadecimal notation has a very straight forward conversion into Hexadecimal numbers: Q: what is the corresponding hex number of ? All computer codes are in some form fixed in length by the restriction of a byte representation in a computer. Thus, a code space is normally represented by a so called fixed length encoding. Please note that in computer systems, 00AF16 represents a two byte code whereas AF16 represents a one byte code. They are considered different codepoint assignments! Mathematically, however, we all know that even though there number of digits are different in terms of representation, they have the same numerical value. Therefore, we introduced a so called scalar value to denote the actual numeric value of a fixed length number notation. Lecture 1

ASCII code table 0x00 - 0x1F and 7F control characters
0x20 - 0x7E graphic characters(printable chars) Code range: range of values in code-point assignment The code range is 00 to 7F for ASCII Code range may not start from scalar value zero Let us take a look at the ASCII(American Standard Code for Information Interchange) codeset. ASCII is an international accepted coding standard. As a character set, it has 128 distinct symbols. Q: What is the minimum number of bits ASCII characters needs? The codepoint assignment is given in the table. The row numbers correspond to the higher first 3 digits and the column number correspond to the lower 4 digits. Notice that when you have a codeset, each character can be ordered naturally through the assigned code-points, which we call internal order. Q: If we sort data according to internal code, what should be the order for input CfdMa? In a codeset, characters are further categorized by different sub-classes called subsets. In ASCII, there is a subset called control characters in the range of 0x00 to 0x1F which, when recognized by a computer system, should trigger certain actions. For example, <CR> (for Carriage Return) should trigger the cursor to move to the next line. Another subset in ASCII in the range of 0x20 to 0x7F is called printable characters, which define symbols used in the written scripts. Lecture 1

Row-Cell notation: Matrix with row number and column number defines a cell and thus the order of the characters, also avoid binary notation. This is particularly useful when the code range is not consecutive. Character subsets: putting characters of similar nature next to each other, different subsets in different rows Some codepoints in the code space may not have any character assignment, they are called empty codepoints. Sometimes, the code-points are not consecutive, this can happen both in the first byte or the second byte in multi-byte coding standards. To avoid the confusion, there is also a logical row-cell notation where code-points are not explicitly mentioned, instead, the row-cell notation is used, where decimal numbers are used and the row and cell numbers always start from 1. For examples, Big5 has 94 rows and 157 cells. The first row actually use the number (0xA1) and the code range for the first byte is 0xA1 to FE and Second Byte is either 0x40-7E or A1-FE. Q: The number of rows is (FE16 – A ) = 9410 Q:The number of cells is (FE16 – A ) + (7E16 – ) = ? A codeset usually is composed of subsets of related characters. We call them character subsets(usually have different names depending on the nature of the characters) or character classification. Example: Control characters (e.g. CTRL-C), Alphanumeric characters (e.g. A..Z, 0..9), Symbols (e.g. +, -) arbitrary: +, - , /, *, (, ) Having character subsets has two major advantages: It is easy for people to remember a group of characters and can refer to it easily The ordering of the subset can be assigned according to features related to the data in the subset only. This helps with sorting and searching. Lecture 1

Codeset Compatibility
For two character sets, C1 and C2, equivalence: C1 = C2 , subset: C1  C2, superset: C1  C2 intersect: C1  C2 , C1  C2  Examples: GB&B5 -> ? GB&GBK -> ? For two coded character sets: CC1={(c1i, code1i) | c1i  C1 and code1i  CODE1 } CC2={(c2i, code2i) | c2i  C2 and code2i  CODE2 } If for every (c1i, code1i)  CC1, it is true that (c1i, code1i)  CC2 then, CC2 is said to be fully compatible with CC1 When you are given two character sets, you can always test for the relationships which can be operated on two sets, namely equivalence, subset, superset, intersect. Two coded character sets can also be operated by equivalence, subset, superset, intersect. However, there are more complicated relationships for two codesets as in a codeset, each element is a two tuple with a character, and a codpoint. Not only one needs to look at the character collection, one also needs to look at the codepoints assignments. So an additional concept of compatibility is introduced. If Codeset2 is fully compatible with Codeset1, the character set of Codeset2 must be a superset of the character set of Codeset1 and for every character in Codeset1, its codepoint assignment must be exactly the same as in Codeset2. Example: if C1 = (A, B, C, D}, C2 = {A, B, C, D, E, F},C1  C2 However, if CC1 (C1) = {(A, 00), (B, 01), (C,10), (D,11)}, and CC2(C2) = {(A, 0000), (B, 0001), (C, 0010), (D, 0011), (E, 0100), (F, 0101)}, we can only say that the character set of CC1 is a subset of CC2 . CC2 is not considered fully compatible with CC1. If for every (c1i, code1i)  CC1, there is (c1i, code2i)  CC2, and scalar(code2i )=code1i,, we sometimes also say that CC2 is compatible with CC1 even though CC2 is not fully compatible with CC1 . Lecture 1

Multilingual Computing

Similar presentations

Presentation on theme: "Multilingual Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multilingual Computing

Similar presentations

Presentation on theme: "Multilingual Computing"— Presentation transcript:

Similar presentations

About project

Feedback