RecAPI
|
The Character Set is the group of validated characters for a given zone. It can vary for each zone. The recognition module associated with the zone inquires the Character Set assigned to the zone immediately before the recognition step.
The Character Set concept incorporated in the Engine applies to the following text recognition modules: MOR, MTX, FRX, PLUS2W and PLUS3W, DOT, HNR and RER.
You can improve text recognition accuracy by narrowing the range of characters validated for recognition, so the recognition module does not always have the difficult job of choosing one solution from all 500 characters (and even from multiple shapes of every character) in Engine's Total Character Set.
Most recognition modules use this information to improve recognition, some do not. Even responsive engines may return filtered out characters because the limited character set is just a hint for them.
Limiting the character set assumes that the application or user has some prior knowledge of what types of texts or characters will be encountered on a page or zone, e.g.
The following steps describe how the Character Set can be defined, per image (with global settings) or per zone (with local settings).
Typically texts to be recognized are written by and for people in natural languages. To describe the character components of the languages the Engine provides two tools.
The 119 supported languages (and their combinations) can be selected directly using the LANGUAGES enum.
The language selection may not always fully meet a user’s needs:
This is why we provide a second flexible tool, to open a backdoor to complement or construct the Language environment character by character, using the LanguagesPlus setting.
This Language environment is global, i.e., it remains valid for the whole image, and for all future ones until any of its components are changed.
This is the most frequently used tool for the limitation of the Character Set. You enable one or more of the 119 languages. This validates all letters and language-related characters needed for those languages, plus all digits, punctuation and miscellaneous characters.
E.g. selection of German without Spanish enables the typical German letter "O diaeresis" but disables the Spanish "Inverted Question Mark".
Related functions:
As a parameter for both of these functions an array of languages is used. The enum LANGUAGES defines the position-language relationship
Here you define any additionally needed characters, e.g. to handle some foreign words in a text.
Related functions:
Filters can be used to limit the character set defined by the Language environment to specific character categories. This filtering can be a Global filter (applied at image/page level) or a local filter (applied per-zone). FILTER_ALL switches all filtering off, enabling all the characters in the Language environment. A filter can be built up from any combination (binary OR-ed) of the following five disjunct elements plus a sixth special one:
These elements are rather rigid, to make it more flexible the Engine provides a sixth one: FILTER_PLUS.
This additionally enables a group of individually validated characters, called the FilterPlus characters, set through the kRecSetFilterPlus function.
As an example of filtering, when your document is a questionnaire containing only capitals, you can use the filter FILTER_UPPERCASE.
Some pre-defined combined filters are available: FILTER_ALPHA for all letters and FILTER_NUMBERS for the digits plus all FILTER_PLUS characters.
Each zone in the image has a ZONE structure defining its properties (coordinates, size, filling method and recognition module to be applied etc.). One of the fields in this structure is the filter field.
If automatic decomposition (auto-zoning) detects the zones, this filter field will always have the value FILTER_DEFAULT, which means that for these zones a common page-level filtering, i.e. the Global filter, will be applied.
The application can change this field, or can create zones with different filter values for the individual zones defining Local filters.
Related functions and enums:
When the Engine is initialized, the Global filter setting takes the value FILTER_ALL (i.e. no filtering). You can set it to any other default value using the function kRecSetDefaultFilter. The Global filter setting will be applied to every zone having the ZONE’s filter field value FILTER_DEFAULT.
Related functions and enums:
As already stated, each ZONE structure has a field filter. If it is filled with any value other than FILTER_DEFAULT, the zone-level, Local filter will be used and any Global filter is ignored.
Related functions:
These zone properties are typically defined by the kRecInsertZone or kRecUpdateZone functions.