RecAPI
Character Set in the Engine

The Character Set

The Character Set is the group of validated characters for a given zone. It can vary for each zone. The recognition module associated with the zone inquires the Character Set assigned to the zone immediately before the recognition step.

The Character Set concept incorporated in the Engine applies to the following text recognition modules: MOR, MTX, FRX, PLUS2W and PLUS3W, DOT, HNR and RER.

The purpose of limiting the Character Set

You can improve text recognition accuracy by narrowing the range of characters validated for recognition, so the recognition module does not always have the difficult job of choosing one solution from all 500 characters (and even from multiple shapes of every character) in Engine's Total Character Set.

Most recognition modules use this information to improve recognition, some do not. Even responsive engines may return filtered out characters because the limited character set is just a hint for them.

Limiting the character set assumes that the application or user has some prior knowledge of what types of texts or characters will be encountered on a page or zone, e.g.

  • the language(s) may be known.
  • certain character classes may be excluded (e.g. no lowercase letters)
  • there may be a limited set of permissible characters (e.g. in form-type documents).

The following steps describe how the Character Set can be defined, per image (with global settings) or per zone (with local settings).

Language Environment

Typically texts to be recognized are written by and for people in natural languages. To describe the character components of the languages the Engine provides two tools.

The 119 supported languages (and their combinations) can be selected directly using the LANGUAGES enum.

The language selection may not always fully meet a user’s needs:

  • Linguistic sources do not agree on the full alphabets of some languages.
  • There may be differences between archaic and modern forms or between dialects of a given language.
  • Languages transcribed from other alphabets may use different norms.
  • Some foreign words and names in a text may use accented letters not supported by the language setting.

This is why we provide a second flexible tool, to open a backdoor to complement or construct the Language environment character by character, using the LanguagesPlus setting.

This Language environment is global, i.e., it remains valid for the whole image, and for all future ones until any of its components are changed.

Note:
The Language environment does not always equal the Character Set, since filters can be applied, either globally (per image) or locally (per zone), see later.

Language selection (global)

This is the most frequently used tool for the limitation of the Character Set. You enable one or more of the 119 languages. This validates all letters and language-related characters needed for those languages, plus all digits, punctuation and miscellaneous characters.

E.g. selection of German without Spanish enables the typical German letter "O diaeresis" but disables the Spanish "Inverted Question Mark".

Related functions:

As a parameter for both of these functions an array of languages is used. The enum LANGUAGES defines the position-language relationship

Note:
The list of supported characters and languages depends on the recognition module applied. Only the RM_OMNIFONT_MOR, RM_OMNIFONT_PLUS2W and RM_OMNIFONT_PLUS3W multi-lingual module support all 119 languages (20 with dictionary support). RM_OMNIFONT_MTX supports 12/12, RM_OMNIFONT_FRX supports 54/17, RM_DOT supports 76/14, RM_RER supports 97/15. RM_HNR does not recognize languages. For more detail see the topics Recognition module specifications and Languages and modules.
Only the omnifont recognition modules support all punctuation and miscellaneous characters. To see which are supported by the other modules go to the Characters (punctuation / miscellaneous) and modules topic.
In rare cases you may define no language and build the Character Set only from individually defined characters.
For more on languages, characters, modules and Code Pages, see Introduction to language-related topics.

LanguagesPlus characters (global)

Here you define any additionally needed characters, e.g. to handle some foreign words in a text.

Related functions:

Note:
The recognition modules RM_OMNIFONT_MOR, RM_OMNIFONT_FRX, RM_OMNIFONT_PLUS2W, RM_OMNIFONT_PLUS3W, RM_DOT and RM_RER accept the LanguagesPlus characters. RM_OMNIFONT_MTX does not support LanguagesPlus characters.
To discover which accented letters are supported for each language, and which modules support them, see Languages and characters.
To revalidate individual characters removed by filtering, use FILTER_PLUS.

Filtering

Filters can be used to limit the character set defined by the Language environment to specific character categories. This filtering can be a Global filter (applied at image/page level) or a local filter (applied per-zone). FILTER_ALL switches all filtering off, enabling all the characters in the Language environment. A filter can be built up from any combination (binary OR-ed) of the following five disjunct elements plus a sixth special one:

These elements are rather rigid, to make it more flexible the Engine provides a sixth one: FILTER_PLUS.

This additionally enables a group of individually validated characters, called the FilterPlus characters, set through the kRecSetFilterPlus function.

As an example of filtering, when your document is a questionnaire containing only capitals, you can use the filter FILTER_UPPERCASE.

Some pre-defined combined filters are available: FILTER_ALPHA for all letters and FILTER_NUMBERS for the digits plus all FILTER_PLUS characters.

Activation of filters

Each zone in the image has a ZONE structure defining its properties (coordinates, size, filling method and recognition module to be applied etc.). One of the fields in this structure is the filter field.

If automatic decomposition (auto-zoning) detects the zones, this filter field will always have the value FILTER_DEFAULT, which means that for these zones a common page-level filtering, i.e. the Global filter, will be applied.

The application can change this field, or can create zones with different filter values for the individual zones defining Local filters.

Related functions and enums:

  • CHR_FILTER : This enum lists the disjunct filter elements plus other pre-defined combinations and a switch for FilterPlus characters.
  • kRecSetFilterPlus : Lets you set the FilterPlus characters.
  • kRecGetFilterPlus : Gets the current FilterPlus characters setting.
Note:
Remember that some recognition modules impose their own limitations, e.g. RM_HNR is limited to digits plus four symbols. Any filter for a character category can only validate the characters in that category supported by the assigned recognition module.
For the value FILTER_DEFAULT to take effect, it must be the only one in the field.

Global filter

When the Engine is initialized, the Global filter setting takes the value FILTER_ALL (i.e. no filtering). You can set it to any other default value using the function kRecSetDefaultFilter. The Global filter setting will be applied to every zone having the ZONE’s filter field value FILTER_DEFAULT.

Related functions and enums:

Local filter

As already stated, each ZONE structure has a field filter. If it is filled with any value other than FILTER_DEFAULT, the zone-level, Local filter will be used and any Global filter is ignored.

Note:
It is important to note that there is only one single FilterPlus set of characters. At zone-level the application can enable or disable the usage of this single set only.

Related functions:
These zone properties are typically defined by the kRecInsertZone or kRecUpdateZone functions.