Table of Contents
Sometimes I need to view files from the Window$ world. But the problem is that the "infernal" world uses UTF-16 while most my Linux tools can only handle UTF8, especially zh-autoconvert, a tool that can do smart conversion between Chinese encoding formats.
So, how to access those Unicode files from the Window$ world?
![]() | |
You can create Unicode file using WordPad in Win32 (during saving select "Unicode Text Document" format). In *nix use YUdit to view them. |
'recode' converts files between character sets and usages. AutoConvert is an intelligent Chinese Encoding converter. It uses builtin functions to judge the type of the input file's Chinese Encoding (such as GB/Big5/HZ), then converts the input file to any type of Chinese Encoding you want. You can use autoconvert to handle incoming mail, automatically converting messages to the Chinese Encoding you want.
cat test2.txt | recode -f UTF-16..UTF-8 | autogb -i utf8
Here is the test file that I saved from WordPad using the "Unicode Text Document" format that I was looking for ways to decode:
$ cat test2.txt | od -t x1 0000000 ff fe 22 6b ce 8f bf 8b ee 95 0d 00 0a 00 77 69 0000020 53 4f 20 00 c6 51 06 57 80 7b 53 4f 0d 00 0a 00
Here is the file saved from above 'test2.txt' via yudit using the UTF8 encoding (As we can see, UTF-16 uses less space than UTF8 when encoding Chinese, despite that the file has the 2 extra by 'ff fe' at the top, and all line delimiter are 4 bytes long: '0d 00 0a 00'):
$ cat test3.txt | od -t x1 0000000 e6 ac a2 e8 bf 8e e8 ae bf e9 97 ae 0d 0a e6 a5 0000020 b7 e4 bd 93 20 e5 87 86 e5 9c 86 e7 ae 80 e4 bd 0000040 93 0d 0a
The file can be directly converted by autogb:
cat test3.txt | autogb -i utf8
To save an extra step saving an UTF8 version manually use yudit, use 'recode':
cat test2.txt | recode -f UTF-16..UTF-8 | autogb -i utf8
FYI, you may wonder, since recode can handle almost any format, can we use recode to convert directly from UTF-16 to Chinese? Unfortunately, the answer is no:
$ cat test2.txt | recode -f UTF-16..GB_2312-80 ;6S-7CNJP-
No Chinese any more. I.e., when converting Chinese encodings, you can't live without 'AutoConvert'.
CJK-Howto on Unicode
http://linux-cjk.net/Unicode/
UTF-8 and Unicode FAQ
http://linux-cjk.net/Howto/cam.ac.uk/unicode.html
documented on: 2007-09-15