Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ruby M17N RubyKaigi 08 RubyKaigi 08 Martin J. D ü rst.

Similar presentations


Presentation on theme: "Ruby M17N RubyKaigi 08 RubyKaigi 08 Martin J. D ü rst."— Presentation transcript:

1 Ruby M17N RubyKaigi 08 RubyKaigi 08 Martin J. D ü rst

2 Summary (intro) (intro) Ruby (Ruby s Case) Ruby (Ruby s Case) (transcoding) by Martin (transcoding) by Martin (questions) (questions)

3 Who is naruse nkf nkf Softbank Technology Softbank Technology –iPhone –iPhone

4 Ruby M17N Ruby M17N CSI CSI

5 M17N methods: UCS Normalization UCS Normalization CSI (Code Set Independent) CSI (Code Set Independent)

6 UCS Normalization UCS Normalization UCS (Universal Character Set) UCS (Universal Character Set)

7 Perl's case (Unicode) Decode: $str = decode("UTF-8", "\xE3\x81\x82"); $str = decode("UTF-8", "\xE3\x81\x82"); $str " " $str " "Encode: $bytes = encode("UTF-8", " "); $bytes = encode("UTF-8", " "); $bytes "\xE3\x81\x82" $bytes "\xE3\x81\x82"

8 CSI CSI Code Set Independent Code Set Independent Solaris, Citrus Solaris, Citrus __STDC_ISO_10646__ C __STDC_ISO_10646__ C

9 Ruby Ruby String String " ".encoding " ".encoding -> ->

10 3 Encoding Grades ASCII Compatible ASCII Compatible ASCII Incompatible ASCII Incompatible Dummy Dummy

11 ASCII Compatible full support full support script encoding script encoding faster faster UTF-8, Shift_JIS, EUC-JP,... UTF-8, Shift_JIS, EUC-JP,...

12 Major Encodings US-ASCII US-ASCII ASCII-8BIT ASCII-8BIT UTF-8 UTF-8

13 Japanese Encodings Shift_JIS Shift_JIS EUC-JP EUC-JP

14 Other Encodings Big5, EUC-KR, EUC-TW, GBK, Big5, EUC-KR, EUC-TW, GBK, ISO-8859-X, KOI8-R, KOI8-U, etc ISO-8859-X, KOI8-R, KOI8-U, etc

15 Machine dependend Encodings Windows-31J Windows-31J CP51932 CP51932 eucJP-ms eucJP-ms Windows-125X Windows-125X

16 ASCII-8BIT ASCII Compatible 8BIT String ASCII Compatible 8BIT String BINARY? BINARY?

17 ASCII Only 7BIT String is special 7BIT String is special "abcde".ascii_only? -> true "abcde".ascii_only? -> true "abcde" + " " "abcde" + " "

18 ASCII Incompatible limited support limited support Can t use as script encoding Can t use as script encoding UTF-{16,32}{BE,LE} UTF-{16,32}{BE,LE}

19 UTF-16 & UTF-32 UTF-16BE, UTF-16LE UTF-16BE, UTF-16LE UTF-32BE, UTF-32LE UTF-32BE, UTF-32LE UTF-16 UTF-32 UTF-16 UTF-32

20 Dummy encoding Ruby Ruby for stateful encodings for stateful encodings Encoding#dummy? -> true Encoding#dummy? -> true ISO-2022-JP, UTF-7 ISO-2022-JP, UTF-7

21 Encoding.list Encoding.list Encoding.list [encoding,..] [encoding,..] Encoding.name_list Encoding.name_list [enc_name,..] [enc_name,..] Encoding.aliases Encoding.aliases {alias => enc_name,..} {alias => enc_name,..}

22 $KCODE is obsolete $KCODE $KCODE Ruby1.9 Ruby1.9 $KCODE $KCODE

23 String 1.8: Byte String 1.8: Byte String –Ruby ignores encoding 1.9: Byte String with encoding 1.9: Byte String with encoding –Ruby knows the encoding of string

24 No Character Object but 1 Character String but 1 Character String ?.class -> String ?.class -> StringWhy?

25 A character has... codepoint codepoint encoding encoding byte string byte string 1 char string has them! 1 char string has them! cf. cf.

26 1.8: ?a ?a 97 (Fixnum) 97 (Fixnum) ?\x61 ?\x61 97 (Fixnum) 97 (Fixnum)1.9: ?a ?a "a" (US-ASCII) ?\x61 ?\x61 "a" (US-ASCII) ? ? " " (UTF-8) ?\u{3042} ?\u{3042} " " (UTF-8)

27 String#ord and Integer#chr " ".ord # Unicode " ".ord # Unicode chr chr RangeError: out of char range RangeError: out of char range chr("UTF-8") chr("UTF-8") " " " "

28 ?a.encoding ?a.encoding "a".encoding "a".encoding "\xFF".encoding "\xFF".encoding "\u{3042}".encoding "\u{3042}".encoding "\u{ }".encoding "\u{ }".encoding

29 String#[] String#[] 1.8: String#[] integer (1 byte) String#[] integer (1 byte) [0] 0xE3 # UTF-8 [0] 0xE3 # UTF-81.9: String#[] 1 string String#[] 1 string [0] " " [0] " "

30 String#length 1.8: 1.8: String#length byte length String#length byte length.length 9 (UTF-8).length 9 (UTF-8) 1.9: 1.9: String#length character length String#length character length.length 3.length 3 String#bytesize byte length String#bytesize byte length.bytesize 9 (UTF-8).bytesize 9 (UTF-8)

31 String is not Enumerable String#each is removed. String#each is removed. "hoge".each{|l|p l} "hoge".each{|l|p l} NoMethodError NoMethodError

32 String#each_* String#each_byte (bytes) String#each_byte (bytes) String#each_char (chars) String#each_char (chars) String#each_line (lines) String#each_line (lines)

33 == == ( ArgumentError) 7bit 7bit

34 /(.)/ =~ " " /(.)/ =~ " " $1 " " $1 " "

35 /\xE3\x81\x82/n =~ " " /\xE3\x81\x82/n =~ " " ArgumentError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) ArgumentError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) ASCII-8BIT ASCII-8BIT

36 bytes = "A ".force_encoding( bytes = "A ".force_encoding("ASCII-8BIT") /\xE3\x81\x82/ =~ bytes 1 /\xE3\x81\x82/ =~ bytes 1 /a/ =~ bytes 0 /a/ =~ bytes 0

37 Script Encoding

38 Magic Comment #!/bin/env ruby # -*- coding: UTF-8 -*- /coding[:=]\s*(?[\w.-]+)[^\w.-]/

39 -K option -K -K Encoding.external_encoding Encoding.external_encoding script encoding script encoding -E (external encoding ) -E (external encoding )

40 script encoding 1. magic comment 2. -K 3. US-ASCII -E -E -e stdin 1. magic comment 2. -K or – E 3. Locale locale locale

41 String#inspect vs String#dump String#inspect String#dump: dump dump Escape dump Escape dump

42 IO open open(path, "r:utf-8") {|f| puts f.gets } open(path, "r:utf-8") {|f| puts f.gets } open(path, "r:utf-8:euc-jp") {.. } open(path, "r:utf-8:euc-jp") {.. } open(path, "mode:external:internal") open(path, "mode:external:internal")

43 IO with encoding option open(path, encoding: "utf-8") open(path, encoding: "utf-8") open(path, encoding: "utf-8:euc- jp") open(path, encoding: "utf-8:euc- jp") open(path, encoding: " external:internal ") open(path, encoding: " external:internal ")

44 IO with encoding option open(path, open(path, external_encoding: "utf-8") open(path, open(path, external_encoding: "utf-8, internal_encoding: "euc-jp")

45 Encoding.defult_external Default encoding for external input -K or -E > locale -K or -E > locale

46 String as Bytes String#getbyte(index) String#getbyte(index) String#setbyte(index, value) String#setbyte(index, value) String#bytesize String#bytesize

47 transcoding Martin Martin RubyKaigiM17N.html RubyKaigiM17N.html RubyKaigiM17N.html RubyKaigiM17N.html

48 Encoding Encoding String#encode String#encode Magic comment Magic comment

49 Dir.open Dir.open encoding encoding –Dir.glob, fnmatch String#encode (transcode) String#encode (transcode) Unicode Win32API ? Unicode Win32API ?

50 RubyM17N RubyM17N !!! !!!

51 any questions? any questions?

52 * UCS * UCS * CSI * CSI

53 UCS UCS * UCS * UCS * magic comment * magic comment * UCS * UCS

54 CSI CSI * encoding * encoding * magic comment * magic comment

55 FAQ Any questions? Any questions?

56 * * * US-ASCII * US-ASCII

57 [0x3042,0x3044].pack("U") * pack("U*") encoding * pack("U*") encoding * pack("U*") UTF-8 * pack("U*") UTF-8 * pack("UC") * pack("UC")

58 require require US-ASCII US-ASCII


Download ppt "Ruby M17N RubyKaigi 08 RubyKaigi 08 Martin J. D ü rst."

Similar presentations


Ads by Google