[ Index ] |
PHP Cross Reference of Unnamed Project |
[Summary view] [Print] [Text view]
1 =head1 NAME 2 3 Encode::Supported -- Encodings supported by Encode 4 5 =head1 DESCRIPTION 6 7 =head2 Encoding Names 8 9 Encoding names are case insensitive. White space in names 10 is ignored. In addition, an encoding may have aliases. 11 Each encoding has one "canonical" name. The "canonical" 12 name is chosen from the names of the encoding by picking 13 the first in the following sequence (with a few exceptions). 14 15 =over 2 16 17 =item * 18 19 The name used by the Perl community. That includes 'utf8' and 'ascii'. 20 Unlike aliases, canonical names directly reach the method so such 21 frequently used words like 'utf8' don't need to do alias lookups. 22 23 =item * 24 25 The MIME name as defined in IETF RFCs. This includes all "iso-"s. 26 27 =item * 28 29 The name in the IANA registry. 30 31 =item * 32 33 The name used by the organization that defined it. 34 35 =back 36 37 In case I<de jure> canonical names differ from that of the Encode 38 module, they are always aliased if it ever be implemented. So you can 39 safely tell if a given encoding is implemented or not just by passing 40 the canonical name. 41 42 Because of all the alias issues, and because in the general case 43 encodings have state, "Encode" uses an encoding object internally 44 once an operation is in progress. 45 46 =head1 Supported Encodings 47 48 As of Perl 5.8.0, at least the following encodings are recognized. 49 Note that unless otherwise specified, they are all case insensitive 50 (via alias) and all occurrence of spaces are replaced with '-'. 51 In other words, "ISO 8859 1" and "iso-8859-1" are identical. 52 53 Encodings are categorized and implemented in several different modules 54 but you don't have to C<use Encode::XX> to make them available for 55 most cases. Encode.pm will automatically load those modules on demand. 56 57 =head2 Built-in Encodings 58 59 The following encodings are always available. 60 61 Canonical Aliases Comments & References 62 ---------------------------------------------------------------- 63 ascii US-ascii ISO-646-US [ECMA] 64 ascii-ctrl Special Encoding 65 iso-8859-1 latin1 [ISO] 66 null Special Encoding 67 utf8 UTF-8 [RFC2279] 68 ---------------------------------------------------------------- 69 70 I<null> and I<ascii-ctrl> are special. "null" fails for all character 71 so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL 72 CHARACTERS will fall back to character references. Ditto for 73 "ascii-ctrl" except for control characters. For fallback modes, see 74 L<Encode>. 75 76 =head2 Encode::Unicode -- other Unicode encodings 77 78 Unicode coding schemes other than native utf8 are supported by 79 Encode::Unicode, which will be autoloaded on demand. 80 81 ---------------------------------------------------------------- 82 UCS-2BE UCS-2, iso-10646-1 [IANA, UC] 83 UCS-2LE [UC] 84 UTF-16 [UC] 85 UTF-16BE [UC] 86 UTF-16LE [UC] 87 UTF-32 [UC] 88 UTF-32BE UCS-4 [UC] 89 UTF-32LE [UC] 90 UTF-7 [RFC2152] 91 ---------------------------------------------------------------- 92 93 To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another, 94 see L<Encode::Unicode>. 95 96 UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit 97 encoding. It is implemented seperately by Encode::Unicode::UTF7. 98 99 =head2 Encode::Byte -- Extended ASCII 100 101 Encode::Byte implements most single-byte encodings except for 102 Symbols and EBCDIC. The following encodings are based on single-byte 103 encodings implemented as extended ASCII. Most of them map 104 \x80-\xff (upper half) to non-ASCII characters. 105 106 =over 2 107 108 =item ISO-8859 and corresponding vendor mappings 109 110 Since there are so many, they are presented in table format with 111 languages and corresponding encoding names by vendors. Note that 112 the table is sorted in order of ISO-8859 and the corresponding vendor 113 mappings are slightly different from that of ISO. See 114 L<http://czyborra.com/charsets/iso8859.html> for details. 115 116 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others 117 ---------------------------------------------------------------- 118 N. America (ASCII) cp437 AdobeStandardEncoding 119 cp863 (DOSCanadaF) 120 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep 121 hp-roman8 122 cp860 (DOSPortuguese) 123 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman 124 MacCroatian 125 MacRomanian 126 MacRumanian 127 Latin3[1] iso-8859-3 128 Latin4[2] iso-8859-4 129 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic 130 (See also next section) cp866 MacUkrainian 131 Arabic iso-8859-6 cp864 cp1256 MacArabic 132 cp1006 MacFarsi 133 Greek iso-8859-7 cp737 cp1253 MacGreek 134 cp869 (DOSGreek2) 135 Hebrew iso-8859-8 cp862 cp1255 MacHebrew 136 Turkish iso-8859-9 cp857 cp1254 MacTurkish 137 Nordics iso-8859-10 cp865 138 cp861 MacIcelandic 139 MacSami 140 Thai iso-8859-11[3] cp874 MacThai 141 (iso-8859-12 is nonexistent. Reserved for Indics?) 142 Baltics iso-8859-13 cp775 cp1257 143 Celtics iso-8859-14 144 Latin9 [4] iso-8859-15 145 Latin10 iso-8859-16 146 Vietnamese viscii cp1258 MacVietnamese 147 ---------------------------------------------------------------- 148 149 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9. 150 [2] Baltics. Now on 8859-10, except for Latvian. 151 [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0) 152 [4] Nicknamed Latin0; the Euro sign as well as French and Finnish 153 letters that are missing from 8859-1 were added. 154 155 All cp* are also available as ibm-*, ms-*, and windows-* . See also 156 L<http://czyborra.com/charsets/codepages.html>. 157 158 Macintosh encodings don't seem to be registered in such entities as 159 IANA. "Canonical" names in Encode are based upon Apple's Tech Note 160 1150. See L<http://developer.apple.com/technotes/tn/tn1150.html> 161 for details. 162 163 =item KOI8 - De Facto Standard for the Cyrillic world 164 165 Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more 166 popular in the Net. L<Encode> comes with the following KOI charsets. 167 For gory details, see L<http://czyborra.com/charsets/cyrillic.html> 168 169 ---------------------------------------------------------------- 170 koi8-f 171 koi8-r cp878 [RFC1489] 172 koi8-u [RFC2319] 173 ---------------------------------------------------------------- 174 175 =back 176 177 =head2 gsm0338 - Hentai Latin 1 178 179 GSM0338 is for GSM handsets. Though it shares alphanumerals with 180 ASCII, control character ranges and other parts are mapped very 181 differently, mainly to store Greek characters. There are also escape 182 sequences (starting with 0x1B) to cover e.g. the Euro sign. 183 184 This was once handled by L<Encode::Bytes> but because of all those 185 unusual specifications, Encode 2.20 has relocated the support to 186 L<Encode::GSM0338>. See L<Encode::GSM0338> for details. 187 188 =over 2 189 190 =item gsm0338 support before 2.19 191 192 Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not 193 well-defined and decode() will return an empty string for them. 194 One possible workaround is 195 196 $gsm =~ s/\x00\z/\x00\x00/; 197 $uni = decode("gsm0338", $gsm); 198 $uni .= "\xA0" if $gsm =~ /\x1B\z/; 199 200 Note that the Encode implementation of GSM0338 does not implement the 201 reuse of Latin capital letters as Greek capital letters (for example, 202 the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL 203 LETTER ZETA). 204 205 The GSM0338 is also covered in Encode::Byte even though it is not 206 an "extended ASCII" encoding. 207 208 =back 209 210 =head2 CJK: Chinese, Japanese, Korean (Multibyte) 211 212 Note that Vietnamese is listed above. Also read "Encoding vs Charset" 213 below. Also note that these are implemented in distinct modules by 214 countries, due to the size concerns (simplified Chinese is mapped 215 to 'CN', continental China, while traditional Chinese is mapped to 216 'TW', Taiwan). Please refer to their respective documentation pages. 217 218 =over 2 219 220 =item Encode::CN -- Continental China 221 222 Standard DOS/Win Macintosh Comment/Reference 223 ---------------------------------------------------------------- 224 euc-cn [1] MacChineseSimp 225 (gbk) cp936 [2] 226 gb12345-raw { GB12345 without CES } 227 gb2312-raw { GB2312 without CES } 228 hz 229 iso-ir-165 230 ---------------------------------------------------------------- 231 232 [1] GB2312 is aliased to this. See L<Microsoft-related naming mess> 233 [2] gbk is aliased to this. See L<Microsoft-related naming mess> 234 235 =item Encode::JP -- Japan 236 237 Standard DOS/Win Macintosh Comment/Reference 238 ---------------------------------------------------------------- 239 euc-jp 240 shiftjis cp932 macJapanese 241 7bit-jis 242 iso-2022-jp [RFC1468] 243 iso-2022-jp-1 [RFC2237] 244 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES } 245 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES } 246 jis0212-raw { JIS X 0212 (Extended Kanji) without CES } 247 ---------------------------------------------------------------- 248 249 =item Encode::KR -- Korea 250 251 Standard DOS/Win Macintosh Comment/Reference 252 ---------------------------------------------------------------- 253 euc-kr MacKorean [RFC1557] 254 cp949 [1] 255 iso-2022-kr [RFC1557] 256 johab [KS X 1001:1998, Annex 3] 257 ksc5601-raw { KSC5601 without CES } 258 ---------------------------------------------------------------- 259 260 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this. 261 See below. 262 263 =item Encode::TW -- Taiwan 264 265 Standard DOS/Win Macintosh Comment/Reference 266 ---------------------------------------------------------------- 267 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten} 268 big5-hkscs 269 ---------------------------------------------------------------- 270 271 =item Encode::HanExtra -- More Chinese via CPAN 272 273 Due to the size concerns, additional Chinese encodings below are 274 distributed separately on CPAN, under the name Encode::HanExtra. 275 276 Standard DOS/Win Macintosh Comment/Reference 277 ---------------------------------------------------------------- 278 big5ext CMEX's Big5e Extension 279 big5plus CMEX's Big5+ Extension 280 cccii Chinese Character Code for Information Interchange 281 euc-tw EUC (Extended Unix Character) 282 gb18030 GBK with Traditional Characters 283 ---------------------------------------------------------------- 284 285 =item Encode::JIS2K -- JIS X 0213 encodings via CPAN 286 287 Due to size concerns, additional Japanese encodings below are 288 distributed separately on CPAN, under the name Encode::JIS2K. 289 290 Standard DOS/Win Macintosh Comment/Reference 291 ---------------------------------------------------------------- 292 euc-jisx0213 293 shiftjisx0123 294 iso-2022-jp-3 295 jis0213-1-raw 296 jis0213-2-raw 297 ---------------------------------------------------------------- 298 299 =back 300 301 =head2 Miscellaneous encodings 302 303 =over 2 304 305 =item Encode::EBCDIC 306 307 See L<perlebcdic> for details. 308 309 ---------------------------------------------------------------- 310 cp37 311 cp500 312 cp875 313 cp1026 314 cp1047 315 posix-bc 316 ---------------------------------------------------------------- 317 318 =item Encode::Symbols 319 320 For symbols and dingbats. 321 322 ---------------------------------------------------------------- 323 symbol 324 dingbats 325 MacDingbats 326 AdobeZdingbat 327 AdobeSymbol 328 ---------------------------------------------------------------- 329 330 =item Encode::MIME::Header 331 332 Strictly speaking, MIME header encoding documented in RFC 2047 is more 333 of encapsulation than encoding. However, their support in modern 334 world is imperative so they are supported. 335 336 ---------------------------------------------------------------- 337 MIME-Header [RFC2047] 338 MIME-B [RFC2047] 339 MIME-Q [RFC2047] 340 ---------------------------------------------------------------- 341 342 =item Encode::Guess 343 344 This one is not a name of encoding but a utility that lets you pick up 345 the most appropriate encoding for a data out of given I<suspects>. See 346 L<Encode::Guess> for details. 347 348 =back 349 350 =head1 Unsupported encodings 351 352 The following encodings are not supported as yet; some because they 353 are rarely used, some because of technical difficulties. They may 354 be supported by external modules via CPAN in the future, however. 355 356 =over 2 357 358 =item ISO-2022-JP-2 [RFC1554] 359 360 Not very popular yet. Needs Unicode Database or equivalent to 361 implement encode() (because it includes JIS X 0208/0212, KSC5601, and 362 GB2312 simultaneously, whose code points in Unicode overlap. So you 363 need to lookup the database to determine to what character set a given 364 Unicode character should belong). 365 366 =item ISO-2022-CN [RFC1922] 367 368 Not very popular. Needs CNS 11643-1 and -2 which are not available in 369 this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra. 370 Autrijus Tang may add support for this encoding in his module in future. 371 372 =item Various HP-UX encodings 373 374 The following are unsupported due to the lack of mapping data. 375 376 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8 377 '15' - japanese15, korean15, and roi15 378 379 =item Cyrillic encoding ISO-IR-111 380 381 Anton Tagunov doubts its usefulness. 382 383 =item ISO-8859-8-1 [Hebrew] 384 385 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and 386 MacHebrew are supported because and just because there were mappings 387 available at L<http://www.unicode.org/>). Contributions welcome. 388 389 =item ISIRI 3342, Iran System, ISIRI 2900 [Farsi] 390 391 Ditto. 392 393 =item Thai encoding TCVN 394 395 Ditto. 396 397 =item Vietnamese encodings VPS 398 399 Though Jungshik Shin has reported that Mozilla supports this encoding, 400 it was too late before 5.8.0 for us to add it. In the future, it 401 may be available via a separate module. See 402 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> 403 and 404 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut> 405 if you are interested in helping us. 406 407 =item Various Mac encodings 408 409 The following are unsupported due to the lack of mapping data. 410 411 MacArmenian, MacBengali, MacBurmese, MacEthiopic 412 MacExtArabic, MacGeorgian, MacKannada, MacKhmer 413 MacLaotian, MacMalayalam, MacMongolian, MacOriya 414 MacSinhalese, MacTamil, MacTelugu, MacTibetan 415 MacVietnamese 416 417 The rest which are already available are based upon the vendor mappings 418 at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> . 419 420 =item (Mac) Indic encodings 421 422 The maps for the following are available at L<http://www.unicode.org/> 423 but remain unsupport because those encodings need algorithmical 424 approach, currently unsupported by F<enc2xs>: 425 426 MacDevanagari 427 MacGurmukhi 428 MacGujarati 429 430 For details, please see C<Unicode mapping issues and notes:> at 431 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> . 432 433 I believe this issue is prevalent not only for Mac Indics but also in 434 other Indic encodings, but the above were the only Indic encodings 435 maps that I could find at L<http://www.unicode.org/> . 436 437 =back 438 439 =head1 Encoding vs. Charset -- terminology 440 441 We are used to using the term (character) I<encoding> and I<character 442 set> interchangeably. But just as confusing the terms byte and 443 character is dangerous and the terms should be differentiated when 444 needed, we need to differentiate I<encoding> and I<character set>. 445 446 To understand that, here is a description of how we make computers 447 grok our characters. 448 449 =over 2 450 451 =item * 452 453 First we start with which characters to include. We call this 454 collection of characters I<character repertoire>. 455 456 =item * 457 458 Then we have to give each character a unique ID so your computer can 459 tell the difference between 'a' and 'A'. This itemized character 460 repertoire is now a I<character set>. 461 462 =item * 463 464 If your computer can grow the character set without further 465 processing, you can go ahead and use it. This is called a I<coded 466 character set> (CCS) or I<raw character encoding>. ASCII is used this 467 way for most cases. 468 469 =item * 470 471 But in many cases, especially multi-byte CJK encodings, you have to 472 tweak a little more. Your network connection may not accept any data 473 with the Most Significant Bit set, and your computer may not be able to 474 tell if a given byte is a whole character or just half of it. So you 475 have to I<encode> the character set to use it. 476 477 A I<character encoding scheme> (CES) determines how to encode a given 478 character set, or a set of multiple character sets. 7bit ISO-2022 is 479 an example of a CES. You switch between character sets via I<escape 480 sequences>. 481 482 =back 483 484 Technically, or mathematically, speaking, a character set encoded in 485 such a CES that maps character by character may form a CCS. EUC is such 486 an example. The CES of EUC is as follows: 487 488 =over 2 489 490 =item * 491 492 Map ASCII unchanged. 493 494 =item * 495 496 Map such a character set that consists of 94 or 96 powered by N 497 members by adding 0x80 to each byte. 498 499 =item * 500 501 You can also use 0x8e and 0x8f to indicate that the following sequence of 502 characters belongs to yet another character set. To each following byte 503 is added the value 0x80. 504 505 =back 506 507 By carefully looking at the encoded byte sequence, you can find that the 508 byte sequence conforms a unique number. In that sense, EUC is a CCS 509 generated by a CES above from up to four CCS (complicated?). UTF-8 510 falls into this category. See L<perlUnicode/"UTF-8"> to find out how 511 UTF-8 maps Unicode to a byte sequence. 512 513 You may also have found out by now why 7bit ISO-2022 cannot comprise 514 a CCS. If you look at a byte sequence \x21\x21, you can't tell if 515 it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 516 so you have no trouble differentiating between "!!". and S<" ">. 517 518 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) 519 520 This section tries to classify the supported encodings by their 521 applicability for information exchange over the Internet and to 522 choose the most suitable aliases to name them in the context of 523 such communication. 524 525 =over 2 526 527 =item * 528 529 To (en|de)code encodings marked by C<(**)>, you need 530 C<Encode::HanExtra>, available from CPAN. 531 532 =back 533 534 Encoding names 535 536 US-ASCII UTF-8 ISO-8859-* KOI8-R 537 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 538 EUC-KR Big5 GB2312 539 540 are registered with IANA as preferred MIME names and may 541 be used over the Internet. 542 543 C<Shift_JIS> has been officialized by JIS X 0208:1997. 544 L<Microsoft-related naming mess> gives details. 545 546 C<GB2312> is the IANA name for C<EUC-CN>. 547 See L<Microsoft-related naming mess> for details. 548 549 C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw> 550 with Encode. See L<Encode::CN> for details. 551 552 EUC-CN 553 KOI8-U [RFC2319] 554 555 have not been registered with IANA (as of March 2002) but 556 seem to be supported by major web browsers. 557 The IANA name for C<EUC-CN> is C<GB2312>. 558 559 KS_C_5601-1987 560 561 is heavily misused. 562 See L<Microsoft-related naming mess> for details. 563 564 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw> 565 with Encode. See L<Encode::KR> for details. 566 567 UTF-16 UTF-16BE UTF-16LE 568 569 are IANA-registered C<charset>s. See [RFC 2781] for details. 570 Jungshik Shin reports that UTF-16 with a BOM is well accepted 571 by MS IE 5/6 and NS 4/6. Beware however that 572 573 =over 2 574 575 =item * 576 577 C<UTF-16> support in any software you're going to be 578 using/interoperating with has probably been less tested 579 then C<UTF-8> support 580 581 =item * 582 583 C<UTF-8> coded data seamlessly passes traditional 584 command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded 585 data is likely to cause confusion (with its zero bytes, 586 for example) 587 588 =item * 589 590 it is beyond the power of words to describe the way HTML browsers 591 encode non-C<ASCII> form data. To get a general impression, visit 592 L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>. 593 While encoding of form data has stabilized for C<UTF-8> encoded pages 594 (at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to 595 expect fun (and cross-browser discrepancies) with C<UTF-16> encoded 596 pages! 597 598 =back 599 600 The rule of thumb is to use C<UTF-8> unless you know what 601 you're doing and unless you really benefit from using C<UTF-16>. 602 603 ISO-IR-165 [RFC1345] 604 VISCII 605 GB 12345 606 GB 18030 (**) (see links bellow) 607 EUC-TW (**) 608 609 are totally valid encodings but not registered at IANA. 610 The names under which they are listed here are probably the 611 most widely-known names for these encodings and are recommended 612 names. 613 614 BIG5PLUS (**) 615 616 is a proprietary name. 617 618 =head2 Microsoft-related naming mess 619 620 Microsoft products misuse the following names: 621 622 =over 2 623 624 =item KS_C_5601-1987 625 626 Microsoft extension to C<EUC-KR>. 627 628 Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla). 629 630 See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html> 631 for details. 632 633 Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common 634 misusage. I<Raw> C<KS_C_5601-1987> encoding is available as 635 C<kcs5601-raw>. 636 637 See L<Encode::KR> for details. 638 639 =item GB2312 640 641 Microsoft extension to C<EUC-CN>. 642 643 Proper names: C<CP936>, C<GBK>. 644 645 C<GB2312> has been registered in the C<EUC-CN> meaning at 646 IANA. This has partially repaired the situation: Microsoft's 647 C<GB2312> has become a superset of the official C<GB2312>. 648 649 Encode aliases C<GB2312> to C<euc-cn> in full agreement with 650 IANA registration. C<cp936> is supported separately. 651 I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>. 652 653 See L<Encode::CN> for details. 654 655 =item Big5 656 657 Microsoft extension to C<Big5>. 658 659 Proper name: C<CP950>. 660 661 Encode separately supports C<Big5> and C<cp950>. 662 663 =item Shift_JIS 664 665 Microsoft's understanding of C<Shift_JIS>. 666 667 JIS has not endorsed the full Microsoft standard however. 668 The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208 669 character sets, while Microsoft has always used C<Shift_JIS> 670 to encode a wider character repertoire. See C<IANA> registration for 671 C<Windows-31J>. 672 673 As a historical predecessor, Microsoft's variant 674 probably has more rights for the name, though it may be objected 675 that Microsoft shouldn't have used JIS as part of the name 676 in the first place. 677 678 Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and 679 provided as an alias by Encode): C<Windows-31J>. 680 681 Encode separately supports C<Shift_JIS> and C<cp932>. 682 683 =back 684 685 =head1 Glossary 686 687 =over 2 688 689 =item character repertoire 690 691 A collection of unique characters. A I<character> set in the strictest 692 sense. At this stage, characters are not numbered. 693 694 =item coded character set (CCS) 695 696 A character set that is mapped in a way computers can use directly. 697 Many character encodings, including EUC, fall in this category. 698 699 =item character encoding scheme (CES) 700 701 An algorithm to map a character set to a byte sequence. You don't 702 have to be able to tell which character set a given byte sequence 703 belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an 704 example of being both a CCS and CES. 705 706 =item charset (in MIME context) 707 708 has long been used in the meaning of C<encoding>, CES. 709 710 While the word combination C<character set> has lost this meaning 711 in MIME context since [RFC 2130], the C<charset> abbreviation has 712 retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>: 713 714 This document uses the term "charset" to mean a set of rules for 715 mapping from a sequence of octets to a sequence of characters, such 716 as the combination of a coded character set and a character encoding 717 scheme; this is also what is used as an identifier in MIME "charset=" 718 parameters, and registered in the IANA charset registry ... (Note 719 that this is NOT a term used by other standards bodies, such as ISO). 720 [RFC 2277] 721 722 =item EUC 723 724 Extended Unix Character. See ISO-2022. 725 726 =item ISO-2022 727 728 A CES that was carefully designed to coexist with ASCII. There are a 7 729 bit version and an 8 bit version. 730 731 The 7 bit version switches character set via escape sequence so it 732 cannot form a CCS. Since this is more difficult to handle in programs 733 than the 8 bit version, the 7 bit version is not very popular except for 734 iso-2022-jp, the I<de facto> standard CES for e-mails. 735 736 The 8 bit version can form a CCS. EUC and ISO-8859 are two examples 737 thereof. Pre-5.6 perl could use them as string literals. 738 739 =item UCS 740 741 Short for I<Universal Character Set>. When you say just UCS, it means 742 I<Unicode>. 743 744 =item UCS-2 745 746 ISO/IEC 10646 encoding form: Universal Character Set coded in two 747 octets. 748 749 =item Unicode 750 751 A character set that aims to include all character repertoires of the 752 world. Many character sets in various national as well as industrial 753 standards have become, in a way, just subsets of Unicode. 754 755 =item UTF 756 757 Short for I<Unicode Transformation Format>. Determines how to map a 758 Unicode character into a byte sequence. 759 760 =item UTF-16 761 762 A UTF in 16-bit encoding. Can either be in big endian or little 763 endian. The big endian version is called UTF-16BE (equal to UCS-2 + 764 surrogate support) and the little endian version is called UTF-16LE. 765 766 =back 767 768 =head1 See Also 769 770 L<Encode>, 771 L<Encode::Byte>, 772 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, 773 L<Encode::EBCDIC>, L<Encode::Symbol> 774 L<Encode::MIME::Header>, L<Encode::Guess> 775 776 =head1 References 777 778 =over 2 779 780 =item ECMA 781 782 European Computer Manufacturers Association 783 L<http://www.ecma.ch> 784 785 =over 2 786 787 =item ECMA-035 (eq C<ISO-2022>) 788 789 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> 790 791 The specification of ISO-2022 is available from the link above. 792 793 =back 794 795 =item IANA 796 797 Internet Assigned Numbers Authority 798 L<http://www.iana.org/> 799 800 =over 2 801 802 =item Assigned Charset Names by IANA 803 804 L<http://www.iana.org/assignments/character-sets> 805 806 Most of the C<canonical names> in Encode derive from this list 807 so you can directly apply the string you have extracted from MIME 808 header of mails and web pages. 809 810 =back 811 812 =item ISO 813 814 International Organization for Standardization 815 L<http://www.iso.ch/> 816 817 =item RFC 818 819 Request For Comments -- need I say more? 820 L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>, 821 L<http://www.faqs.org/rfcs/> 822 823 =item UC 824 825 Unicode Consortium 826 L<http://www.unicode.org/> 827 828 =over 2 829 830 =item Unicode Glossary 831 832 L<http://www.unicode.org/glossary/> 833 834 The glossary of this document is based upon this site. 835 836 =back 837 838 =back 839 840 =head2 Other Notable Sites 841 842 =over 2 843 844 =item czyborra.com 845 846 L<http://czyborra.com/> 847 848 Contains a lot of useful information, especially gory details of ISO 849 vs. vendor mappings. 850 851 =item CJK.inf 852 853 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> 854 855 Somewhat obsolete (last update in 1996), but still useful. Also try 856 857 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> 858 859 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>. 860 861 =item Jungshik Shin's Hangul FAQ 862 863 L<http://jshin.net/faq> 864 865 And especially its subject 8. 866 867 L<http://jshin.net/faq/qa8.html> 868 869 A comprehensive overview of the Korean (C<KS *>) standards. 870 871 =item debian.org: "Introduction to i18n" 872 873 A brief description for most of the mentioned CJK encodings is 874 contained in 875 L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html> 876 877 =back 878 879 =head2 Offline sources 880 881 =over 2 882 883 =item C<CJKV Information Processing> by Ken Lunde 884 885 CJKV Information Processing 886 1999 O'Reilly & Associates, ISBN : 1-56592-224-7 887 888 The modern successor of C<CJK.inf>. 889 890 Features a comprehensive coverage of CJKV character sets and 891 encodings along with many other issues faced by anyone trying 892 to better support CJKV languages/scripts in all the areas of 893 information processing. 894 895 To purchase this book, visit 896 L<http://www.oreilly.com/catalog/cjkvinfo/> 897 or your favourite bookstore. 898 899 =back 900 901 =cut
title
Description
Body
title
Description
Body
title
Description
Body
title
Body
Generated: Tue Mar 17 22:47:18 2015 | Cross-referenced by PHPXref 0.7.1 |