RecAPI
Checking

The Spell Checking Module is mainly used to achieve better accuracy in the recognized output text and/or to make proofing more effective. Its services are closely integrated into the recognition process, they are activated internally either during the recognition (by some recognition modules) or immediately after the recognition process. That is why all the checking-related settings have to be set BEFORE calling the recognition process.

The checking subsystem consists of the following three independent parts:

  • Spell checking: using language-specific dictionary elements,
  • UD-checking: checking against the user dictionary (with words, strings).

Any combination of these parts is possible during the recognition process as steps in determining the acceptability of words. The most important API function in controlling the checking module is the kRecSetSpell function. This function is used to enable or disable the running of the checking module. Note that the use of these checkings can still be enabled or disabled separately at zone level. The disabling of checking may speed up the recognition a bit, but causes worse accuracy.

Spell checking

Spell checking uses third-party language-specific spell checkers. There are two kinds of the spell checkers:

  • Language dictionaries ("language checking"), and
  • Vertical dictionaries ("vertical language checking").

The Capture SDK is delivered with over 20 different language dictionaries. These are generic language dictionaries; they typically contain between 100,000 and 200,000 entries. Vertical dictionaries apply to special professions and in the toolkit they can be treated as extensions to the language dictionaries, though they can even be used when no language dictionary is specified. The Capture SDK is delivered with 9 different Vertical dictionaries for three professions (financial, medical and legal) in four languages (English, German, French and Dutch).

For enumerating the language and vertical dictionaries, use the function pairs kRecGetFirstSpellLanguage, kRecGetNextSpellLanguage and kRecGetFirstVerticalDict, kRecGetNextVerticalDict respectively.

Specifying the Spelling language (kRecSetSpellLanguage) and/or a Vertical dictionary (kRecSetVerticalDictionaries) is a global setting (of course it may differ in different Settings Collections). Once a language dictionary has been specified, the language checking will be applied to all zones on the page, except any user zones with the CHK_LANGDICT_PROHIBIT flag (i.e. the checking can be disabled at zone level). Similarly, once a vertical dictionary has been specified, the vertical language checking will be applied to all zones on the page, except any user zones with the CHK_VERTDICT_PROHIBIT flag.

When the spelling language is set to LANG_NO or LANG_UD, or when it specifies a language for which there is no language dictionary, the language checking will not be activated for the zones of the page. In the special (default) case, when the specified spelling language is set to LANG_AUTO, the language checking will be performed based on the recognition language selection (kRecSetLanguages) as follows:

  • If the recognition language selection is empty, there won't be language checking.
  • If one recognition language has been set, that language will be automatically selected for language checking assuming that the appropriate language dictionary is available.
  • If two or more languages are selected for recognition, language checking will be done for all selected languages, for which a language dictionary is available.
Note:
The following example shows how to improve the performance of OCR on pages containing Dutch legal text:
    RECERR rc;
    ...
    rc = kRecSetSpellLanguage(0, LANG_DUT);
    VDICTDESC setdesc="Dutch Legal Dictionary";
    rc = kRecSetVerticalDictionaries(0, &setdesc, 1);
    ...

UD-checking

The checking module also makes use of a user dictionary. A user dictionary is a collection of user-specific elements, the so-called UDitems. UDitem is a word as in the case of any wordprocessor's user dictionary. A string being checked will be accepted if it conforms to at least one item of the user dictionary. The User Dictionary can be given by kRecSetUserDictionary.

If the application uses spell checking and it consistently encounters words marked as uncertain that are spelled correctly, or it is known that the document contains many proper nouns, the application can reduce unwanted marking and improve recognition accuracy by performing UD-checking, to supplement the spell checking (assuming that the user dictionary has been prepared previously by adding the required words to it). In this case the UD-checking is complementary to the spell checking.

UD-checking without spell checking enabled is typically used in form-like applications (e.g. questionnaires), i.e. where the data to be recognized is highly structured and follows predictable patterns.

For particular user zones the UD-checking can be disabled with the CHK_USERDICT_PROHIBIT flag.

The checking subsystem can handle two kinds of User dictionaries: native dictionary files (created or updated by a previous kRecSaveUD call), and word-list file (a text file containing words, one in each line).

Compiling and changing a user dictionary requires that the application specify a user dictionary file with the kRecSetUserDictionary function followed by a call to the kRecOpenMaintenanceUD function. At this point the content of the user dictionary can be listed, and items can be added (kRecAddItemUD) or removed (kRecDeleteItemUD) on request until kRecCloseMaintenanceUD is called. The kRecGetUDState function can be used to learn whether there has been any change since the User dictionary was last opened for maintenance. Changes can be made permanent by calling the kRecSaveUD function, before closing.

A new user dictionary can be created by calling the kRecSetUserDictionary function with a NULL parameter.

UDitems of the currently opened user dictionary can be enumerated with the kRecGetFirstItemUD, kRecGetNextItemUD function-pair.

The opened user dictionary must be closed before recognition.

Note:
The following code sample creates a new user dictionary.
    RECERR rc;
    ...
    // A word.
    LPCWSTR item        = L"Nuance";

    // Remove active user dictionary.
    rc = kRecSetUserDictionary(0, NULL, NULL);
    // Start user dictionary maintenance mode.
    rc = kRecOpenMaintenanceUD(0);
    // Add an item.
    rc = kRecAddItemUD(0, NULL, item, 0);
    // Save changes.
    rc = kRecSaveUD(0, "testud.dic", FALSE);
    // Stop user dictionary maintenance mode.
    rc = kRecCloseMaintenanceUD(0);
    // Use new user dictionary.
    rc = kRecSetUserDictionary(0, "testud.dic", NULL);
    ...