Non-English Encoding


Table of Contents

cmd:cpdetector 
Basic Info 
Why doesn't Linux display Japanese file names encoded in UTF-8 

cmd:cpdetector 

Basic Info 

Usage 

Info 

cpdetector - code page detector

Description 

cpdetector is a framework for configurable code page-detection of documents. It may be used to detect the code page of documents retrieved from remote hosts.

Code page - detection is needed whenever it is not known, which encoding a document belongs to. Therefore it is a core requirement for any application in the field of information mining or just information retrieval.

Even today - where unicode unifies all different character encodings by providing a unique number (codepoint) for every character - the documents in the internet are encoded in various different code pages. Especially asian documents consist of a huge amount of characters and therefore often are encoded in special language-specific codepages. In order to process a textual document, it's bits have to be mapped (decoded) to characters by the correct character encoding table (code page).

documented on: 2007.05.23