Decoding WordPad Unicode files 

Situation 

Sometimes I need to view files from the Window$ world. But the problem is that the "infernal" world uses UTF-16 while most my Linux tools can only handle UTF8, especially zh-autoconvert, a tool that can do smart conversion between Chinese encoding formats.

So, how to access those Unicode files from the Window$ world?

Solution Synopsis 

Solution 

yudit for viewing

yudit is a Unicode text editor for the X Window System. It does not need localized environment or Unicode fonts. It supports simultaneous processing of many languages, conversions for local character standards, bidirectional input, has its own input methods. The package includes conversion utilities, and it also has support for PostScript printing.

Tip You can create Unicode file using WordPad in Win32 (during saving select "Unicode Text Document" format). In *nix use YUdit to view them.
'recode' & 'autogb' for converting

'recode' converts files between character sets and usages. AutoConvert is an intelligent Chinese Encoding converter. It uses builtin functions to judge the type of the input file's Chinese Encoding (such as GB/Big5/HZ), then converts the input file to any type of Chinese Encoding you want. You can use autoconvert to handle incoming mail, automatically converting messages to the Chinese Encoding you want.

cat test2.txt | recode -f UTF-16..UTF-8 | autogb -i utf8

Analysis / Working History 

Here is the test file that I saved from WordPad using the "Unicode Text Document" format that I was looking for ways to decode:

$ cat test2.txt | od -t x1
0000000 ff fe 22 6b ce 8f bf 8b ee 95 0d 00 0a 00 77 69
0000020 53 4f 20 00 c6 51 06 57 80 7b 53 4f 0d 00 0a 00

Here is the file saved from above 'test2.txt' via yudit using the UTF8 encoding (As we can see, UTF-16 uses less space than UTF8 when encoding Chinese, despite that the file has the 2 extra by 'ff fe' at the top, and all line delimiter are 4 bytes long: '0d 00 0a 00'):

$ cat test3.txt | od -t x1
0000000 e6 ac a2 e8 bf 8e e8 ae bf e9 97 ae 0d 0a e6 a5
0000020 b7 e4 bd 93 20 e5 87 86 e5 9c 86 e7 ae 80 e4 bd
0000040 93 0d 0a

The file can be directly converted by autogb:

cat test3.txt | autogb -i utf8

To save an extra step saving an UTF8 version manually use yudit, use 'recode':

cat test2.txt | recode -f UTF-16..UTF-8 | autogb -i utf8

FYI, you may wonder, since recode can handle almost any format, can we use recode to convert directly from UTF-16 to Chinese? Unfortunately, the answer is no:

$ cat test2.txt | recode -f UTF-16..GB_2312-80
;6S-7CNJP-

No Chinese any more. I.e., when converting Chinese encodings, you can't live without 'AutoConvert'.

References 

CJK-Howto on Unicode http://linux-cjk.net/Unicode/

documented on: 2007-09-15

Introduction to Unicode - Using Unicode in Linux 

http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html

Copyright (c) 2004 Michal Kosmulski.

This article is double-licensed under the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License.

Unicode, or the Universal Character Set (UCS), was developed to end once and for all the problems associated with the abundance of character sets used for writing text in different languages. It is a single character set whose goal is to be a superset of all others used before, and to contain every character used in writing any language (including many dead languages) as well as other symbols used in mathematics and engineering. Any charset can be losslessly converted to Unicode, as we'll see.

ASCII, a character set based on 7-bit integers, used to be and still is popular. While its provision for 128 characters was sufficient at the time of its birth in the 1960s, the growing popularity of personal computing all over the world made ASCII inadequate for people speaking and writing many different languages with different alphabets.

Newer 8-bit character sets, such as the ISO-8859 family, could represent 256 characters (actually fewer, as not all could be used for printable characters). This solution was good enough for many practical uses, but while each character set contained characters necessary for writing several languages, there was no way to put in a single document characters from two different languages that used characters found in two distinct character sets. In the case of plain text files, another problem was how to make software automatically recognize the encoding; in most cases human intervention was required to tell which character set was used for each file. A totally new class of problems was associated with using Asian languages in computing; non-Latin alphabets posed new challenges due to the some languages' needs for more than 256 characters, right to left text, and other features not taken into account by existing standards.

Unicode aims to resolve all of those issues.

Two organizations maintain the Unicode standard — the Unicode Consortium and the International Organization for Standardization (ISO). The names Unicode and ISO/IEC 10646 are equivalent when referring to the character set (however, Unicode Consortium's definition of Unicode provides more than just the character set standard — it also includes a standard for writing bidirectional text and other related issues).

Unicode encodings 

Unicode defines a (rather large) number of characters and assigns each of them a unique number, the Unicode code, by which it can be referenced. How these codes are stored on disk or in a computer's memory is a matter of encoding. The most common Unicode encodings are called UTF-n, where UTF stands for Unicode Transformation Format and n is a number specifying the number of bits in a basic unit used by the encoding.

Note that Unicode changes one assumption which had been correct for many years before, namely that one byte always represents one character. As you'll see, a single Unicode character is often represented by more than one byte of data, since the number of Unicode characters exceeds 256, the number of different values which can be encoded in a single byte. Thus, a distinction must be made between the number of characters and the number of bytes in a piece of text.

Two very common encodings are UTF-16 and UTF-8. In UTF-16, which is used by modern Microsoft Windows systems, each character is represented as one or two 16-bit (two-byte) words. Unix-like operating systems, including Linux, use another encoding scheme, called UTF-8, where each Unicode character is represented as one or more bytes (up to four; an older version of the standard allowed up to six).

UTF-8 has several interesting properties which make it suitable for this task. First, ASCII characters are encoded in exactly the same way in ASCII and in UTF-8. This means that any ASCII text file is also a correct UTF-8 encoded Unicode text file, representing the same text. In addition, when encoding characters that take up more than one byte in UTF-8, characters from the ASCII character set are never used. This ensures, among other things, that if a piece of software interprets such a file as plain ASCII, non-ASCII characters are ignored or in worst case treated as random junk, but they can't be read in as ASCII characters (which could accidentally form some correct but possibly malicious configuration option in a config file or lead to other unpredictable results). Given the importance of text files in Unix, these properties are significant. Thanks to the way UTF-8 was designed, old configuration files, shell scripts, and even lots of age-old software can function properly with Unicode text, even though Unicode was invented years after they came to be.

How Linux handles Unicode 

When we say that a Linux system "can handle Unicode," we usually mean that it meets several conditions:

Thanks to the properties of UTF-8 encoding, the Linux kernel, the innermost and lowest-level part of the operating system, can handle Unicode filenames without even having the user tell it that UTF-8 is to be used. All character strings, including filenames, are treated by the kernel in such a way that they appear to it only as strings of bytes. Thus, it doesn't care and does not need to know whether a pair of consecutive bytes should logically be treated as two characters or a single one. The only risk of the kernel being fooled would be, for example, for a filename to contain a multibyte Unicode character encoded in such a way that one of the bytes used to represent it was a slash or some other character that has a special meaning in file names. Fortunately, as we noted, UTF-8 never uses ASCII characters for encoding multibyte characters, so neither the slash nor any other special character can appear as part of one and therefore there is no risk associated with using Unicode in filenames.

Filesystem types not originally intended for use with Unices, such as those used by Windows, are slightly different as we'll se later on.

User space programs use so-called locale information to correctly convert bytes to characters, and for other tasks such as determining the language for application messages and date and time formats. It is defined by values of special environmental variables. Correctly written applications should be capable of using UTF-8 strings in place of ASCII strings right away, if the locale indicates so.

Most end-user applications can handle Unicode characters, including applications written for the GNOME and KDE desktop environments, OpenOffice.org, the Mozilla family products and others. However, Unicode is more than just a character set — it introduces rules for character composition, bidirectional writing, and other advances features that are not always supported by common software.

Some command-line utilities have problems with multibyte characters. For example, tr always assumes that one character is represented as one byte, regardless of the locale. Also, common shells such as Bash (and other utilities using the getline library, it seems) tend to get confused if multibyte characters are inserted at the command line and then removed using the Backspace or Delete key.

Converting a system to Unicode 

First of all, check whether you're already using a Unicode locale. The command locale prints out the values of environmental variables that influence the locale settings. A complete description of their meanings is available in locale man pages. Usually, locale names consist of a lowercase language code followed by an underscore and an uppercase country code (e.g. en_US for U.S. English). Unicode locale names that use UTF-8 encoding additionally end with ".UTF-8." If such names are present in the output of locale, you are already using a Unicode locale.

If you do need to make the conversion, back up all your important data first, as you'll be converting your disk's filesystems. Note that backups made prior to converting the filesystem and thereafter are somewhat incompatible. As we noted last time, the operating system and many utilities do not realize what characters the bytes in file names represent. Among the utilities with this problem is the tar program, which is a popular backup tool. If you are using an en_US locale now and some filename contains the character "a" (German A umlaut), it is represented as a single byte: hex 0xE4. After you move to UTF-8, it will be represented as two bytes: 0xC3 0xA4. However, neither the filesystem nor tar know that these two different byte sequences can represent the same character. If you restore your old file from backup after moving to a UTF-8 locale, the old one-byte sequence will be used in the restored file's name, making it different from the new version's name. Under a UTF-8 locale, this single byte won't be considered "a" but rather an invalid UTF-8 sequence, and will be displayed as a placeholder or as an octal representation of the erroneous byte only. So if you restore data from older backups or archives after you move to UTF-8, you may need to run a filename conversion similar to the one we'll describe below on the extracted files in order to get the filenames correct.

To be able to use UTF-8 locales without having to invest too much work in it, you will need glibc (GNU C library) version 2.2 or newer (any reasonably modern distro should have it). You can check your version by running /lib/libc.so.6.

Instructions listed in this chapter can lead you through the process of converting a filesystem to use UTF-8 encoding. They also try to give you an idea of what exactly is going on and to help you understand each step. However, performing the conversion this way can be time-consuming. There is a tool called convmv which performs file system conversions in a way which is probably more user-friendly and faster than the method described here, although the educational aspect is not highlighted as much. Convmv is also capable of performing conversions between different Unicode normalization forms (NFC and NFD) which this script is not (note: I have never used convmv so I can't make any statements about its suitability for particular tasks; I'm just repeating here what the program's homepage says). So, if you just want to convert your filesystem as quickly and painlessly as possible, convmv may be a better choice and you can skip over to the next part of this article. However, if you want to gain a better understanding of how things work, read on.

The following paragraphs give a step-by-step description of how to perform the conversion. Most operations described below must be performed as root.

Setting the locale 

Certain environment variables tell applications which locale is to be used. Commonly used variables are:

Before modifying your locale, remember or save to a file the output of locale, which shows your current locale. Also, note down the output of locale -k LC_CTYPE | fgrep charmap (your current character encoding), as you will need this information later on.

In order to tell applications to use UTF-8 encoding, and assuming U.S. English is your preferred language, you could use the following command:

export LC_ALL=en_US.UTF-8

Applications started afterwards from the same terminal window should be aware of UTF-8. To check if that's the case, you could for example use the command wc. wc -c will tell you the number of bytes and wc -m the number of characters in a file or in data read from standard input (end typing with Enter and Ctrl-D). In a UTF-8 locale, if text contains non-ASCII characters, the number of bytes will be greater than the number of characters. For example:

user@host:~$ wc -c
5
user@host:~$ wc -m
4

This three-character word is encoded using four bytes in UTF-8 (the extra character or byte is the end-of-line marker).

If the test failed (i.e. wc returns the same number in both cases), your system probably came without UTF-8 locale definitions, and you will have to use localedef to generate them. For example, if en_US.UTF-8 is missing, you can generate it from en_US using:

localedef -i en_US -f UTF-8 en_US.UTF-8

Since values of environment variables last only as long as your session, you have to put your export commands in /etc/profile so that they are run for each user the next time he or she logs in. If you perform your work from inside KDE, you will have to log out and back in so that environmental variables can be re-read in order for changes to take effect. GNOME seems to always use UTF-8 internally, even if the locale is not UTF-8-based. No matter which desktop environment you are using, it may be necessary to log out and, if you are using a login manager (e.g. KDM or GDM), restart the X Window System by pressing Ctrl-Alt-Backspace so that /etc/profile is re-read and all applications come to know about the new locale.

Converting filesystems 

The next step is to convert your filesystems. This is the only risky part of the transition, so do make a backup of all important data from your disks if you haven't done so yet.

As noted above, the Linux kernel doesn't care about character encodings. For common Linux filesystems (ext2, ext3, ReiserFS, and other filesystems typical for Unices), information that a particular filesystem uses one encoding or another is not stored as a part of that filesystem. Only locale-controlling environment variables tell software that particular bytes should be displayed as one or another character. Filesystems found on Microsoft Windows machines (NTFS and FAT) are different in that they store filenames on disk in some particular encoding. The kernel must translate this encoding to the system encoding, which will be UTF-8 in our case.

If you have Windows partitions on your system, you will have to take care that they are mounted with correct options. For FAT and ISO9660 (used by CD-ROMs) partitions, option utf8 makes the system translate the filesystem's character encoding to UTF-8. For NTFS, nls=utf8 is the recommended option (utf8 should also work). Add these mount options to filesystems of these types in your /etc/fstab to make them mount with the correct options. A fragment of /etc/fstab might then look like this (other options may vary depending upon your setup):

/dev/hda2        /mnt/c           ntfs        defaults,ro,nls=utf8                        1 0
/dev/hda3        /mnt/d           vfat        defaults,quiet,utf8                         1 0
/dev/cdrom       /mnt/cdrom       iso9660     defaults,noauto,users,ro,utf8               0 0
# If using supermount, add "utf8" to the options _after_ two dashes, e.g.
#none             /mnt/cdrom       supermount  fs=iso9660,dev=/dev/cdrom,--,auto,ro,utf8   0 0
/dev/fd0         /mnt/floppy      auto        defaults,noauto,users,rw,quiet,utf8         0 0

After you modify /etc/fstab, you should remount the filesystems in question by issuing a mount -o remount /mnt/mount-point command for each of them. Non-ASCII characters in filenames on those filesystems should now be displayed correctly again. Note that this requires the kernel to be capable of converting between character sets, so support for UTF-8 must be compiled in or available as a module. This option is available in "File systems"->"Native Language Support"->"NLS UTF-8" in the kernel configuration program. Depending upon which encoding your Windows partitions use, you may also need to compile in support for that encoding, too. Check this page for a list of codepages used by various language versions of FAT. NTFS always uses Unicode internally and doesn't need any kernel NLS options except for UTF-8 support.

Native Linux filesystems do not store information about the character encoding used, so you must physically change the names of all files to the new encoding, as opposed to the simple remounting of FAT and NTFS volumes. In theory, all you need to do is execute a command like:

mv original-filename filename-in-UTF-8-encoding

for each file. In practice, things tend to be a little more complicated. First of all, you may already have UTF-8-encoded filenames on your disk without knowing it. For example, some GNOME applications tend to create UTF-8 filenames regardless of the locale used, and kbd (a set of utilities for handling console fonts) comes with a sample file called \u266a\u266c (two music notes) in the documentation. During conversion, these files need to be identified and their names left unchanged.

Another issue to look out for is directories. Since both the directory name and names of files contained within may need changing to their UTF-8 equivalents, you can't simply create a list of all files and directories and then perform mv old-name new-name for each of them. If you did and first renamed a directory, then the path referring to the files in that directory would no longer be valid. So, order is important. In the directory tree, the leaves (i.e. files) should be renamed first, then the lowest-level directories, then their parent directories, and so on.

Below is a script that tries to perform the necessary convertion automatically. Note that using it may be dangerous — back up important data first! While it should work in most cases, this script isn't bulletproof. In order to keep things simple, it doesn't handle some special cases such as spaces in mount paths (which are really rare) and read-only filesystems (it is not obvious what should be done with those; if you do intend to convert a read-only mounted hard disk partition, remount it manually as read-write with mount -o remount,rw /some/mount/path before running the script). Depending on filesystem size and the number of files that need conversion, this script may take a long time to complete, especially since for simplicity's sake it is far from optimal (all this could probably be done in Perl in a much more compact way).

Remember to modify orgcharset in the script below to the name of your old character set found out in one of previous steps using locale -k LC_CTYPE.

#!/bin/sh

fstab=/etc/fstab
orgcharset=INVALID_CHARSET_NAME

export LC_ALL=POSIX

# Find filesystems suitable for conversion
filesystems=`awk '!/vfat|ntfs|iso9960|udf|auto|autofs|swap|subfs|sysfs|proc|devpts|nfs|smbfs|^#/{print $2}' "$fstab"`
# Locate files whose names need to be converted and sort the list
find $filesystems -xdev | {
        while read; do
                # Check if the filename needs conversion (i.e. is not a correct UTF-8 string)
                if ! echo `basename "$REPLY"` | iconv -f UTF-8 -t UTF-8 &>/dev/null; then
                        echo "$REPLY"
                fi
        done
} | sort -r | {
        # Rename files
        while read; do
                dirname=`dirname "$REPLY"`
                orgfname=`basename "$REPLY"`
                newfname=`echo "$orgfname" | iconv -f "$orgcharset" -t UTF-8`
                if [ $? -ne 0 ]; then
                        echo "Error: iconv failed for $REPLY. Skipping." >&2
                        continue
                fi
                mv "$REPLY" "$dirname"/"$newfname"
        done
}

Converting text files 

It is convenient for user text files to use the system's default encoding, so after moving to UTF-8 you may want to convert your text files too. Converting configuration files isn't necessary, as programs that can handle non-ASCII data in their config almost always use UTF-8 for storing it already. You can convert a single text file with iconv:

iconv -f old-encoding -t UTF-8 filename > temp.tmp && cat temp.tmp > filename && rm temp.tmp

Again, make sure this actually works before playing with important data.

Getting fonts with Unicode support 

Unicode fonts for the text console are usually shipped with major Linux distributions. To enable UTF-8 on the console, run unicode_start (unicode_stop to return to previous one-byte encoding mode).

In order to be able to actually see Unicode characters displayed by X applications, you need to download and install Unicode fonts. Bitstream Vera is a TrueType font available under an open source license that now comes with many Linux distributions. Unfortunately, it contains few characters. An extended version, with support for most Latin accented characters, is called Hunky Font. A family of Unicode fonts called FreeFont is available under the GPL. There are also a number of free-as-in-beer fonts on the Web, including Microsoft Core Fonts (a package containing among others the popular typefaces Arial and Times New Roman), Bitstream Cyberbit (only Roman style is available, but it has a very good Unicode coverage), Gentium, and many others. Of course, there are also lots of commercial fonts that can be used with X.

Summary 

Using UTF-8 has many advantages over using a single-byte locale. A minor one is the ability to use any character in file names and on the command line. The main advantage of Unicode, however, is that it allows easier data exchange and better interoperability than any other character set. UTF-8 is meant to replace ASCII in the future, so at some point "text file" is going to mean "UTF-8 file" just as it means "ASCII file" now.

History 

documented on: 2007-09-15

Convert file names to a different encoding with convmv 

http://www.linux.com/feature/58689

By Manolis Tzanidakis on December 11, 2006

Recent versions of most Linux distributions support non-English languages out of the box by using the Unicode standard. I was pleasantly surprised when I found out that I was able to read and write in Greek — my native language — on a fresh Ubuntu Edgy Eft installation without any manual intervention.

Unfortunately, my happiness lasted only until I tried to open files with Greek file names. Instead of Greek characters I saw garbage. I've been using the 8-bit ISO 8859-7 encoding for Greek file names, and since it worked well I was too lazy to convert my systems to Unicode. Manually renaming hundreds of files in order to convert them to Unicode was not an option; I needed some kind of automation. Convmv is the right tool for that job.

Convmv is a Perl program that converts file names and directories between different character encodings. It converts only the file names, not the content of the files, and can also convert a whole filesystem, including symlinks. Most Linux distributions offer packages for convmv, and you can also find it in the FreeBSD ports and NetBSD pkgsrc. Manual installation is fairly easy since the program depends only on Perl, which is installed by default on virtually all Linux distributions and BSD variants; running make install will install the program in /usr/local/bin and its man page in /usr/local/share/man/man1.

Running convmv without any arguments prints the list of all available options. All options are explained in detail in the program's man page.

Let's start by running convmv —list to display the supported encodings. To convert all Ogg/Vorbis files in the current directory from ISO 8859-7 to UTF-8 (Unicode) run convmv -f iso-8859-7 -t utf8 *.ogg. This command will not actually rename the files — it just prints what it should do. To rename the files, add the —notest option. If you want the program to ask for confirmation before any action, add the -i option to enable interactive mode.

By default the program checks whether the file names you want to rename are already using the specified encoding and skips them accordingly. Although you can speed up the whole process by disabling this feature with the —nosmart switch, it's better not to, since it could lead to "double-encoded" file names with incorrect characters. Nevertheless, the man page has a section on how to repair double-encoded files. The program will also stop if you try to rename a file by giving it a name with the target encoding that already exists on the same path. You can however use the —replace switch to have that file overwritten in case its content is the same as that of the original file.

After making sure that your options work correctly, it's time to convert the whole filesystem to UTF-8 with a single command. We will also add the -r switch, which enables recursive mode. For example, issue convmv -f iso-8859-7 -t utf8 -r —notest —replace ~/data to convert all the files and directories inside the data directory in your home from ISO 8859-7 to UTF-8. You can also use convmv to convert file names to all upper or lower case with the —upper and —lower options respectively. If the file is not ASCII-encoded you must also supply its encoding with the -f switch.

Besides the conversion to Unicode, convmv can be useful when you need to exchange files with users of obsolete operating systems that have no support for the Unicode charset, such as Windows 98 or older versions of Linux. Speaking of cross-platform interoperability, Mac OS X has a strange way of handling Unicode-encoded file names. Linux and most other Unix-like OS use the C normalization form (NFC) for encoding to UTF-8, while OS X uses NFD. Convmv can convert file names between these two standards with the —nfc and —nfd switches. You might face similar issues with the JFS and NFS v4 file systems; check the convmv man page for more information.

Convmv made my transition to Unicode as painless as possible. It converted all my files while I was making a cup of coffee, giving me plenty of time to play with the new version of Ubuntu.

documented on: 2007-09-15