Next

Unicode Converting

Table of Contents

Decoding WordPad Unicode files

Situation

Sometimes I need to view files from the Window$ world. But the problem is that the "infernal" world uses UTF-16 while most my Linux tools can only handle UTF8, especially zh-autoconvert, a tool that can do smart conversion between Chinese encoding formats.

So, how to access those Unicode files from the Window$ world?

Solution Synopsis

to view UTF-16 files, use 'yudit'
to convert UTF-16 files, use 'recode' & 'autogb'

Solution

yudit for viewing: yudit is a Unicode text editor for the X Window System. It does not need localized environment or Unicode fonts. It supports simultaneous processing of many languages, conversions for local character standards, bidirectional input, has its own input methods. The package includes conversion utilities, and it also has support for PostScript printing.


	You can create Unicode file using WordPad in Win32 (during saving select "Unicode Text Document" format). In *nix use YUdit to view them.

'recode' & 'autogb' for converting

'recode' converts files between character sets and usages. AutoConvert is an intelligent Chinese Encoding converter. It uses builtin functions to judge the type of the input file's Chinese Encoding (such as GB/Big5/HZ), then converts the input file to any type of Chinese Encoding you want. You can use autoconvert to handle incoming mail, automatically converting messages to the Chinese Encoding you want.

cat test2.txt | recode -f UTF-16..UTF-8 | autogb -i utf8

Analysis / Working History

Here is the test file that I saved from WordPad using the "Unicode Text Document" format that I was looking for ways to decode:

$ cat test2.txt | od -t x1
0000000 ff fe 22 6b ce 8f bf 8b ee 95 0d 00 0a 00 77 69
0000020 53 4f 20 00 c6 51 06 57 80 7b 53 4f 0d 00 0a 00

Here is the file saved from above 'test2.txt' via yudit using the UTF8 encoding (As we can see, UTF-16 uses less space than UTF8 when encoding Chinese, despite that the file has the 2 extra by 'ff fe' at the top, and all line delimiter are 4 bytes long: '0d 00 0a 00'):

$ cat test3.txt | od -t x1
0000000 e6 ac a2 e8 bf 8e e8 ae bf e9 97 ae 0d 0a e6 a5
0000020 b7 e4 bd 93 20 e5 87 86 e5 9c 86 e7 ae 80 e4 bd
0000040 93 0d 0a

The file can be directly converted by autogb:

cat test3.txt | autogb -i utf8

To save an extra step saving an UTF8 version manually use yudit, use 'recode':

cat test2.txt | recode -f UTF-16..UTF-8 | autogb -i utf8

FYI, you may wonder, since recode can handle almost any format, can we use recode to convert directly from UTF-16 to Chinese? Unfortunately, the answer is no:

$ cat test2.txt | recode -f UTF-16..GB_2312-80
;6S-7CNJP-

No Chinese any more. I.e., when converting Chinese encodings, you can't live without 'AutoConvert'.

References

CJK-Howto on Unicode
http://linux-cjk.net/Unicode/

UTF-8 and Unicode FAQ
http://linux-cjk.net/Howto/cam.ac.uk/unicode.html

documented on: 2007-09-15