Unicode

Linux Chinese HOWTO is quite out dated. The latest version is v1.04, 2 June 1998, as of 2002.10.29.

http://linuxselfhelp.com/HOWTO/Chinese-HOWTO.html http://www.ibiblio.org/pub/Linux/docs/HOWTO/Chinese-HOWTO

Markus Kuhn created 1999-06-04 — last modified 2002-09-03, as of 2002.10.29.

http://www.cl.cam.ac.uk/~mgk25/unicode.html

UTF-8 Sampler

Sampler UTF-8 web page by the Kermit project http://www.columbia.edu/kermit/utf8.html

Most recent update: Thu Oct 17 09:32:50 2002, as of 2002.10.29.

v1.0, 23 January 2001, as of 2002.10.29.

This document describes how to change your Linux system so it uses UTF-8 as text encoding.

http://www.linuxdoc.org/HOWTO/Unicode-HOWTO.html http://www.tldp.org/HOWTO/Unicode-HOWTO.html

http://rf.net/~james/perli18n.html

Link http://rf.net/~james/perli18n.html Date 2002 02 18

Q0. Do you have a checklist for internationalizing an application? Q1. I think that I'm a clever programmer. What's so hard about internationalization? Q2. Do you have a glossary of commonly used terms and acronyms?

Perl and locales, Unicode, porting, modules and CPAN

Q3: What locale support does Perl have? Q4. What support does Perl have for Unicode? Q5. How do operating systems implement Unicode and i18N? Q6. I'm a Perl Porter. What should I know about i18N and C? Q7. I'm a Perl Porter. What should I know about Perl and Unicode? Q8. I'm a CPAN module author. What should I know about Perl and Unicode? Q9. Do regular expressions work with locales? Q10. Do regular expressions work with Unicode? Q11. What are these CPAN Unicode modules for? Q11b. What about i18N POD? Q12. What is JPerl?

More General Unicode and Programming Information

Q13. Can I just do nothing and let my program be agnostic of character set? Q14. Why and where should I use Unicode instead of native encodings? Q15. What is Unicode normalization and why is it important? Q16. How do I do auto-detection of Unicode streams? Q17. Is Unicode big endian or little endian? Q18. Is there an EBCDIC-safe transformation of Unicode? Q19. Are there security implications in i18N? Q20. Are there performance issues in i18N? Q21. How do I localize strings in my program? Q22. I do database programming with Perl. Can I use Unicode? Q23. I do database programming with Perl. What are the i18N issues? Q24. How do other programming languages implement Unicode and i18N?

Internationalized Web Programming

Q25. What support for Unicode do web browsers have? Q26. How can I i18N my web pages and CGI programs? Q27. How should I structure my web server directories for international content? Q28. Can web servers automatically detect the language of the browser? Q29. What format do I send strings to the translator?

Internationalized Email Programming

Q30. What are common encodings for email?

iDNS

Q30b. What is happening with internationalized DNS?

Timezones

Q30c. How can I manage timezones in Perl?

References

Q31. Any good references?

Perl Hacks

Q32. How do I convert US-ASCII to UTF-16 on Windows NT? Q33. How do I transform the name of a character encoding to the MIME charset name?

http://people.netscape.com/ftang/i18n.html#detect

UTF-8
- Is This File UTF8? isutf8.pl
- IsUTF8 in C
- A improved version of IsUTF8 in Mozilla source IsUTFText
Cyrillic Charset Detection - http://www.neystadt.org/cyrillic/Lingua-DetectCharset.htm
Chinese Charset Detection
- Chih-Hao Tasi's Frequency and Stroke Counts of Chinese Characters
- SinoDetect by Erik Peterson <eepeter@erols.com>
  
  http://www.mandarintools.com/download/SinoDetect.h http://www.mandarintools.com/download/SinoDetect.cpp http://www.mandarintools.com/download/detecttest.cpp
- Justin Yu's algorithm - http://www.ihep.ac.cn/~yumj/www/chrecog.html

http://www.xfree86.org/pipermail/i18n/2001-March/001379.html

> Also, the new xterm support -u8, there is ja and ko in ISO 10646-1,
> but how to make ISO 10646-1 font works with Chinese?

The ja and ko fonts also contain all glyphs from the commonly used Chinese character sets. They just prefer glyphs from Japanese or Korean character sets where multiple glyphs were available for a single Unicode position. The ideographic *-ISO10646-1 fonts from the ucs-fonts-asian package are a bit of an experimental nature and comments would be very appreciated. I can easily generate a cn version as well using the same software that we used to merge the 18x18 ja and ko fonts. Just suggest a priority order of existing fonts that you would prefer to see merged into a cn font. See the .changes files in the ucs-fonts-asian for documentation on how these fonts were generated.

http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts-asian.tar.gz

With regard to UTF-8 support for xterm:

We seem unfortunately be heading towards a version split. XFree86 has long ago started its own very actively maintained development thread of xterm, managed by Thomas Dickey

http://dickey.his.com/xterm/ ftp://dickey.his.com/xterm/

with various extensions by Robert Brady:

http://www.zepler.org/~rwb197/xterm/

The semantics and design ideas behind this xterm version is summarized for example on

http://www.cl.cam.ac.uk/~mgk25/unicode.html#xterm

in particular considering the behaviour with regard to choosing between normal and wide characters and handling combining characters. This xterm has deliberately (temporarily) hardwired-in support for

a UTF-8 encoder/decoder
a Unicode-specific wcwidth function
Unicode-specific Normalization Form C mapper

in order to guarantee portable usability even on platforms without UTF-8 locale support.

On the other hand, there is now an independent new and not yet widely used Li18nux/X.Org patch for xterm available that is more based on the i18n mechanics of X11 (X Output Methods in particular) that was originally introduced to accomodate national CJK encodings and the suitability of which for Unicode support is still a somewhat controversial topic.

http://www.li18nux.org/subgroups/utildev/dli18npatch.html

How suitable it is in practice for UTF-8 usage (especially considering the large number of practical detail issues that have been discussed on the linux-utf8 mailing list during the past year) will have to be tested thoroughly first.

I'm somewhat disappointed that this second xterm development thread by Li18nux/X.Org was never properly announced/advertised to XFree86 xterm developers here. There still seem to be disappointing communication problems.

Markus G. Kuhn

LANG=zh_CN xterm -u8 -fn -misc-fixed-medium-r-semicondensed--0-0-75-75-c-0-iso10646-1 -e luit &

or,

LANG=zh_CN xterm -u8 -fn -misc-zysong18030-medium-r-normal--0-0-0-0-p-0-iso10646-1 -e luit &

LANG=zh_CN xterm -u8 -fn '-arphic technology co.-ar pl kaitim gb-medium-r-normal--0-0-0-0-p-0-iso10646-1' -e luit &
LANG=zh_CN xterm -u8 -fn '-arphic technology co.-ar pl kaitim big5-medium-r-normal--0-0-0-0-c-0-iso10646-1' -e luit &

then

date

Comments

misc-zysong18030-…-c-0-iso10646-1 (almost?) has all charaters, but the English charaters take up double spce. awful!

Trying History

No effect

xterm is still normal size.

LANG=zh_TW xterm -u8 -fn -misc-fixed-medium-r-semicondensed--0-0-75-75-c-0-iso10646-1 -e luit &

— couldn't find charset data for locale zh_TW; using ISO 8859-1.

No Chinese fonts shown

xterm is double width.

LANG=zh_TW xterm -u8 -fn '-arphic technology co.-ar pl mingti2l big5-medium-r-normal--0-0-0-0-c-0-iso10646-1' -e luit &
LANG=zh_TW xterm -u8 -fn '-arphic technology co.-ar pl kaitim big5-medium-r-normal--0-0-0-0-c-0-iso10646-1' -e luit &

— couldn't find charset data for locale zh_TW; using ISO 8859-1.

No any fonts shown

xterm is double width.

LANG=zh_TW.Big5 xterm -u8 -fn -taipei-fixed-medium-r-normal--0-0-75-75-c-0-big5-0 -e luit &

http://www.debian.or.jp/~kubota/xterm.html

I am working on internationalization (i18n)-related improvement of XTerm, which is included in the distribution of XFree86 and is the most widely used terminal emulator on X Window System in the world.

Status as of 2004.03.09

(2002-09-15) Though internationalization (i.e. LC_CTYPE locale sensibility) has almost finished on 2002-08-17 patch, automatic font selection was not implemented. This means, when XTerm automatically uses UTF-8 mode (luit-using locale-sensible mode also uses UTF-8 mode internally), *-iso10646-1 fonts should be used automatically instead of 8bit fonts.

Status as of 2002.10.29

# (2002-08-17) My 2002-07-18 patch was integrated into CVS repository of XFree86. Now you can use locale-sensibility without any of my patches. We now will use various encodings by XTerm! By improving luit, XTerm will support more encodings. (For example, TCVN, GBK, and Shift_JIS will be supported by using 2002-07-04 patch).

Internationalization (i.e. LC_CTYPE locale sensibility) has almost finished on 2002-08-17 patch, automatic font selection was implemented (patched) on 2002-09-15.

The download is two folds: XTerm (cvs 20020817), and font patch. http://www.debian.or.jp/~kubota/softwares/xterm-20020817.tar.gz http://www.debian.or.jp/~kubota/softwares/xterm-20020918-ufont.diff.gz

Much Older Informations

My work is based on:

the original XTerm by Thomas Dickey and
fine patch by Robert Brady.

Build 20020918 version

prepare

rpm -ih XFree86-devel-4.2.0-72.i386.rpm

cd /usr/X11R6/lib/
ln -s libXaw.so.7.0 libXaw.so
ln -s libXmu.so.6.2 libXmu.so

cd /usr/local/lib
ln -s /usr/X11R6/lib/libXaw* /usr/X11R6/lib/libXmu* .

compile

cd somewhere
tar -xvzf ../xterm-20020817.tar.gz
cd xterm-20020817/

cp ~/xterm-20020918-ufont.diff.gz .
gunzip xterm-20020918-ufont.diff.gz

patch -p1 < xterm-20020918-ufont.diff

chmod 755 configure
configure --enable-256-color --enable-logging --enable-tcap-query --enable-luit --enable-wide-chars --enable-warnings

make
gcc -g -O2 -W -Wall -Wbad-function-cast -Wcast-align -Wcast-qual -DXTSTRINGDEFINES -Winline -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -L/usr/X11R6/lib -o xterm button.o charproc.o charsets.o cursor.o data.o doublechr.o fontutils.o input.o main.o menu.o misc.o print.o ptydata.o screen.o scrollbar.o tabs.o util.o xstrings.o VTPrsTbl.o TekPrsTbl.o Tekproc.o charclass.o precompose.o wcwidth.o -L/usr/X11R6/lib  -lXaw -lXmu -lXext -lXt  -lSM -lICE -lX11 -lnsl /lib/libtermcap.so.2.0.8
make

For install

make install
make install-ti

mkdir /tmp/xterm-i18n
make -n uninstall | sed 's/^rm -f/ echo/' | sh | cpio -vpdm !$

Test

luit -list
LANG=zh_CN LC_CTYPE=zh_CN xterm &
date

Help

-lc: Turn on support of various encodings according to users' LC_CTYPE locale setting, i.e., LC_ALL, LC_CTYPE, or LANG variables. This is achieved by turning on UTF-8 mode and by invoking luit for conversion between locale encodings and UTF-8. (luit is not invoked in UTF-8 locales.) All you need is an iso10646-1 font regardless of your locale and encoding. This corresponds to the locale resource.

The actual list of encodings which are supported is determined by luit. Consult the luit manual page for futher details.

Test History

Not working:

LC_CTYPE=zh_CN.GB18030 xterm &
LANG=zh_CN.GB18030 LC_CTYPE=zh_CN.GB18030 xterm &
LC_CTYPE=zh_CN.GB18030 xterm &
LANG=zh_CN LC_CTYPE=zh_CN xterm &
LANG=zh_CN xterm -lc &
LANG=zh_CN xterm -lc -u8 -e luit &
LANG=zh_CN xterm -u8 -e luit
LANG=GB2312 xterm -u8 -e luit
LANG='GB 2312' xterm -u8 -e luit
xterm -u8 -e luit -g2 'GB 2312'
LANG=zh_CN xterm -u8 -e luit -g2 'GB 2312'
xterm -u8 -e luit -g2 'GB 2312'

Showing Chinese under XTerm

xterm -u8 -fn -misc-fixed-medium-r-semicondensed--0-0-75-75-c-0-iso10646-1 -e luit -g2 'GB 2312' &

then

LANG=zh_CN date

or,

export LANG=zh_CN

Using the luit trick, it worked fine but a great many charaters where missing. I was viewing a chinese frequency list (i.e most common characters at beginning, least common at end) and many very early ones were missing. But at least the whole mechanism seems to work.

Q: which/where Chinese font does luit looks for for the translation?

because

xfd -fn -misc-fixed-medium-r-semicondensed--0-0-75-75-c-0-iso10646-1 &

shows no Chinese fonts.

This also works!

xterm -u8 -fn -misc-zysong18030-medium-r-normal--0-0-0-0-c-0-iso10646-1 -e luit -g2 'GB 2312'

Invoking the above "working" command with LANG=zh_CN will cease to work.

Using the simsun font won't work. I.e., tried but failed:

xterm -u8 -fn -microsoft-simsun-medium-r-normal--0-0-0-0-c-0-gb18030-0 -e luit -g2 'GB 2312' &

black screen, no characters shown (thorough you know them there) big cursor.

Using bitmap fonts is also nok,

xterm -u8 -fn '-isas-fangsong ti-medium-r-normal--0-0-72-72-c-0-gb2312.1980-0' -e luit -g2 'GB 2312' &

Result is almost identical with above MS TrueType font.

The above test & result duplicated and verified in RH8 (2003.10.27 Mon), without any changing to current xfree and xterm. And even direct load works too:

LANG=zh_CN xterm -u8 -fn -misc-fixed-medium-r-semicondensed--0-0-75-75-c-0-iso10646-1 -e luit -g2 'GB 2312' &

Conclusion

It is hardly usable.

misc-fixed-medium-…-iso10646-1 misses a great many charaters, but it looks good.
misc-zysong18030-…-c-0-iso10646-1 (almost?) has all charaters, but it really bad. The English charaters also take up double spce. awful!
What's important, I can nolonger vew "Chinese" in xterm any more. All those familiar Chinese "luanma" are shown as blank now.

So, already having a rxvt solution is enough. And it is almost perfect. Besides, rxvt support XIM also. No bother explore any further.

documented on: 2004.03.09