Full locale description consists of 3 parts: xx_YY.ZZZZ.
-
xx: ISO 639 language codes (lower case)
-
YY: ISO 3166 country codes (upper case)
-
ZZZZ: codeset, i.e., character set or encoding identifier.
For language codes and country codes, see pertinent description in the info
gettext.
Please note this codeset part may be normalized internally to achieve cross
platform compatibility by removing all - and by converting all characters
into lower case. Typical codesets are:
-
UTF-8: Unicode for all regions, mostly in 1-3 Octets (new de facto standard)
-
ISO-8859-1: western Europe (de facto old standard)
-
ISO-8859-2: eastern Europe (Bosnian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian, Slovak, Slovenian)
-
ISO-8859-3: Maltese
-
ISO-8859-5: Macedonian, Serbian
-
ISO-8859-6: Arabic
-
ISO-8859-7: Greek
-
ISO-8859-8: Hebrew
-
ISO-8859-9: Turkish
-
ISO-8859-11: Thai (=TIS-620)
-
ISO-8859-13: Latvian, Lithuanian, Maori
-
ISO-8859-14: Welsh
-
ISO-8859-15: western Europe with euro
-
KOI8-R: Russian
-
KOI8-U: Ukrainian
-
CP1250: Czech, Hungarian, Polish (MS Windows origin)
-
CP1251: Bulgarian, Byelorussian (MS Windows origin)
-
eucJP: Unix style Japanese (=ujis)
-
eucKR: Unix style Korean
-
GB2312: Unix style Simplified Chinese (=GB, =eucCN) for zh_CN
-
Big5: Traditional Chinese for zh_TW
-
sjis: Microsoft style Japanese (Shift-JIS)
As for the meaning of basic encoding system jargons:
-
ASCII: 7 bits (0-0x7f)
-
ISO-8859-?: 8 bits (0-0xff)
-
ISO-10646-1: Universal Character Set (UCS) (31 bits, 0-0x7fffffff)
-
UCS-2: First 16 bit of UCS as straight 2 Octets (Unicode: 0-0xffff)
-
UCS-4: UCS as straight 4 Octets (UCS: 0-0x7fffffff)
-
UTF-8: UCS encoded in 1-6 Octets (mostly in 3 Octets)
-
ISO-2022: 7 bits (0-0xff) with the escape sequence. ISO-2022-JP is the most popular encoding for the Japanese e-mail.
-
EUC: 8 bits + 16 bits combination (0-0xff), Unix style
-
Shift-JIS: 8 bits + 16 bits combination (0-0xff), Microsoft style.
ISO-8859-?, EUC, ISO-10646-1, UCS-2, UCS-4, and UTF-8 share the same code
with ASCII for the 7 bit characters. EUC or Shift-JIS uses high-bit
characters (0x80-0xff) to indicate that part of encoding is 16 bit. UTF-8
also uses high-bit characters (0x80-0xff) to indicate non 7 bit character
sequence bytes and this is the most sane encoding system to handle non-ASCII
characters.
Please note the byte order difference of Unicode implementation:
-
Standard UCS-2, UCS-4: big endian
-
Microsoft UCS-2, UCS-4: little endian for ix86 (machine-dependent)
See Convert a text file with recode, Section 8.6.12 for conversion between
various character sets. For more see Introduction to i18n.
Activating a particular locale
The following environment variables are evaluated in this order to provide
particular locale values to programs:
-
LANGUAGE: This environment variable consists of a colon-separated list
of locale names in order of priority. Used only if the POSIX locale is
set to a value other than "C" [in Woody; the Potato version always has
priority over the POSIX locale]. (GNU extension)
-
LC_ALL: If this is non-null, the value is used for all locale
categories. (POSIX.1) Usually "" (null).
LC_*: If this is non-null, the value is used for the corresponding
category (POSIX.1). Usually "C".
LC_* variables are:
-
LC_CTYPE: Character classification and case conversion.
-
LC_COLLATE: Collation order.
-
LC_TIME: Date and time formats.
-
LC_NUMERIC: Non-monetary numeric formats.
-
LC_MONETARY: Monetary formats.
-
LC_MESSAGES: Formats of informative and diagnostic messages and interactive responses.
-
LC_PAPER: Paper size.
-
LC_NAME: Name formats.
-
LC_ADDRESS: Address formats and location information.
-
LC_TELEPHONE: Telephone number formats.
-
LC_MEASUREMENT: Measurement units (Metric or Other).
-
LC_IDENTIFICATION: Metadata about the locale information.
-
LANG: If this is non-null and LC_ALL is undefined, the value is used for
all LC_* locale categories with undefined values. (POSIX.1) Usually "C".
Note that some applications (e.g., Netscape 4) ignore LC_* settings.
The locale program can display active locale settings and available locales;
see locale(1). (NOTE: locale -a lists all the locales that your system knows
about; this does not mean that all of them are compiled! See Activating
locale support, Section 9.7.4.)