Next

Non-English Encoding

Table of Contents

cmd:cpdetector

Basic Info

Usage

Info

cpdetector - code page detector

Source

http://cpdetector.sourceforge.net/

Description

cpdetector is a framework for configurable code page-detection of documents. It may be used to detect the code page of documents retrieved from remote hosts.

Code page - detection is needed whenever it is not known, which encoding a document belongs to. Therefore it is a core requirement for any application in the field of information mining or just information retrieval.

Even today - where unicode unifies all different character encodings by providing a unique number (codepoint) for every character - the documents in the internet are encoded in various different code pages. Especially asian documents consist of a huge amount of characters and therefore often are encoded in special language-specific codepages. In order to process a textual document, it's bits have to be mapped (decoded) to characters by the correct character encoding table (code page).

documented on: 2007.05.23