To: linux-utf8@nl.linux.org Date: Mon, 2 Dec 2002 19:53:09 -0500 (EST)
cpdetector - code page detector
http://cpdetector.sourceforge.net/
cpdetector is a framework for configurable code page-detection of documents. It may be used to detect the code page of documents retrieved from remote hosts.
Code page - detection is needed whenever it is not known, which encoding a document belongs to. Therefore it is a core requirement for any application in the field of information mining or just information retrieval.
Even today - where unicode unifies all different character encodings by providing a unique number (codepoint) for every character - the documents in the internet are encoded in various different code pages. Especially asian documents consist of a huge amount of characters and therefore often are encoded in special language-specific codepages. In order to process a textual document, it's bits have to be mapped (decoded) to characters by the correct character encoding table (code page).
documented on: 2007.05.23
http://mail.nl.linux.org/linux-utf8/2002-12/msg00010.html
To: linux-utf8@nl.linux.org Date: Mon, 2 Dec 2002 19:53:09 -0500 (EST)
> I created a few Japanese file and directory names in UTF-8 in Windows. Then
How could you make filename and directory names in UTF-8 in Windows? Windows(both NTFS and VFAT) use UTF-16 for filenames.
> I logged in from Linux (7.3) that is configured to run Japanese. From the > login 'language' I can only select 'Japanese (eucJP)' (there is no Japanese > (UTF-8)).
You can easily add 'Japanese(UTF-8)' to your gdm/kdm language selection menu. See <https://bugzilla.mozilla.org/bugzilla/show_bug.cgi?id=75829> Or, you can just set it in ~/.1i8n.
> I did a 'showmount -e 10.xxx.xxx.xxx' but I got scambled Japanese > characters for those entries that are encoded in UTF-8. Then I switched the > locale to ja_JP.UTF-8, but the same stuff was returned. What's wrong with > this picture?
How did you mount Windows filesystem? With smbmount or NFS? If it's NTFS that is mounted via samba, you have to specify 'iocharset=utf-8'. If it's VFAT exported over the net, you also have to specify codepage(for Japanese, it's 932). For local filesystems,, specifying 'utf8' (and 'codepage=932' for VFAT) option to mount command would be sufficient. (see the man pages of mount(8) and fstab)
Needless to say, you have to run your shell in UTF-8 terminal (e.g. xterm 16x or mlterm) to view UTF-8 characters.
Now in case of NFS, I have no idea how 'Windows NFS server' translates UTF-16 used in NTFS and VFAT to multibyte encodings. There must be a server config. option for that.(the default might be the 'ANSI' codepage of the current locale. For Japanese, it's Windows-932/Shift_JIS) For Unix NFS server - Unix client, there's little need for encoding translation although having one would be nice for some cases(e.g. EUC-JP on the server and UTF-8 on the client-side)
Jungshik