RecAPI
CCJK code page description file

There are three sections in these files: property section, range section and code section in this order.

In the property section you can specify properties just like in an INI file. There are three valid properties:

  • Name: name of the code page (maximum 16 characters)
  • Description: description of the code page (maximum 16 characters)
  • MSID: Microsoft code page identifier (a decimal number)

The other two sections contain only hexadecimal values between the signs less-than (U+003C) and greater-than (U+003E). You must use the ASCII characters for digits. The maximum number of digits in a number is 256. These numbers describe the character codes and unicode strings. A character code number must have an even number of hexadecimal digits, and a Unicode string must have a number divisible by four. You can use space (U+0020) and horizontal tabulator (U+0009) to format the numbers. All of these numbers are big endian, that means that the highest byte is the first, and lowest is the last. The unicode string is in UTF-16BE format. That means you can use surrogate pairs.

The range section begins with the line Ranges. This section describes the valid ranges of character codes. The encoding routines use these ranges to calculate the lengths of character codes. In this section every line has two character code numbers. The two numbers must have the same number of digits. The first number is the lowest value in the range, and the second is the highest value. An example: <81A1> <FEFE> That means every code in this range is two byte length. And the first byte of the code is at least 0x81 and at most 0xFE, the second byte of the code is at least 0xA1 and at most 0xFE.

The code section begins with the line Codes. This section describes the assignment between character codes and unicode strings. You can assign multiple unicode strings to one character code, and multiple character codes to one unicode string. You can even assign a unicode string which contains another assigned unicode string. The encoding routines are able to handle these cases. There are three kinds of assignment: single assignment, simple code range assignment and an advanced code range assignment. The single assignment contains two numbers: a code number and a unicode string number. That assigns a code to a unicode string. The code must be in a range defined in the section Ranges. An example: <0505> <E800>. This assigns the 0505 character code to the U+E800 unicode. An example with surrogates: <0506> <DB40DC00>. This assigns the 0506 character code to the U+E0000 unicode. The simple code range assignment has this format:

<start code> <end code> <first unicode>

This assigns multiple character codes to different unicodes. The first code assigns to the first unicode, the next one assigns to a one bigger unicode, and so on. The start code and end code shall be in the same range. The only difference in the start code and the end code shall be that the last byte of the end code is bigger or equal to the first code's last byte. The advanced code range assignment has this format:

<start code>-<end code> in <minimum code>-<maximum code> <first unicode>

All the codes shall be in the same range. This increments the code from start code to end code, but the bytes turns over at the maximum code to the minimum code. And it increments the unicode as well as the code. An example: <0103>-<0301> in <0101>-<0304> <E000>. That means the following assignments:

  • <0103> -> <E000>,
  • <0201> -> <E001>,
  • <0202> -> <E002>,
  • <0203> -> <E003>,
  • <0204> -> <E004>,
  • <0301> -> <E005>.

You can use unicode strings in the code range assignment, but only the last 16-bit unicode or surrogate pair will increment.