RecAPI
|
Module name: | RER |
Module identifier: | RM_RER |
Filling methods supported: | FM_HANDPRINT, FM_CMC7, FM_OCRA, FM_OCRB, FM_MICR FM_OMNIFONT (Thai, Vietnamese, Hebrew) |
Filters supported: | all filter elements |
Trade-off supported: | TO_ACCURATE, TO_FAST (includes TO_BALANCED) |
Knowledge base file: | kadmos.uk , hand_s.rec , numplus.rec , and the below language-specific kb-files. |
Knowledge base file for Thai OCR: | kadmos.uk , ttf_s_th.rec . |
Knowledge base file for Hebrew OCR: | kadmos.uk , ttf_s_il.rec . |
Knowledge base file for Vietnamese OCR: | kadmos.uk , ttf_s_vn.rec . |
Training file supported: | no |
This module is supported on: Windows, Linux, Mac OS X.
This module is included only in the Professional Recognition Kit (not the OCR kit). To make this technology available in your application, it must be covered by your distribution licensing.
Thai, Vietnamese and Hebrew OCR can be purchased as an add-on ("Asian Plus") to either the Professional Recognition Kit or the Professional OCR Kit.
See the topic on Licensing in the General Information help system.
This is a third-party recognition module from reRecognition GmbH, Germany. The Engine contains its recognition engine version 6.0k.
This recognition module can be used for recognition of handprinted alphanumerical characters, i.e. upper and lower case letters, the digits and some others. Although it can be used to read flowing text, its main application area is in form-like situations, where the form designer has great control over the content and maybe length of handprinted information given in each zone.
In addition this module recognizes Thai, Vietnamese and Hebrew text. It can handle short embedded English texts within such language text. Thai language is accessible from version 19.0, Hebrew from 20.1, Vietnamese from 20.2. See details below.
Selecting the filling method FM_HANDPRINT this module can differentiate 159 characters. These are the digits, 28 punctuation and miscellaneous characters (listed below), letters of the English alphabet plus all accented characters necessary for 98 languages. Fifteen languages have dictionary support: Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Spanish and Swedish. Other supported languages include Croatian (with one limitation), Estonian, Gaelic, Indonesian, Latvian, Lithuanian, Slovak, Slovenian, Swahili, Tagalog, Turkish and Welsh (the last two with minor limitations). Cyrillic languages and Greek are not supported. In Hungarian the lower case characters "Small I Acute", "Small O Acute" and the "Small U Acute" are not supported, in effect limiting recognition to upper case characters. These languages can be freely combined, but then dictionary support is not available.
The following punctuation characters can be recognized:
! | Exclamation Mark |
? | Question Mark |
‘ | Apostrophe-Quote |
" | Quotation Mark |
: | Semicolon |
, | Comma |
: | Colon |
. | Period (Full-stop) |
- | Hyphen-Minus |
( | Opening Parenthesis |
) | Closing Parenthesis |
[ | Opening Square Bracket |
] | Closing Square Bracket |
{ | Opening Curly Bracket |
} | Closing Curly Bracket |
The following miscellaneous characters can be recognized:
# | Number Sign |
% | Percent Sign |
@ | Commercial At |
& | Ampersand |
| | Vertical Bar |
$ | Dollar Sign |
* | Asterisk |
+ | Plus Sign |
= | Equals Sign |
_ | Spacing Underscore |
/ | Slash |
\ | Backslash |
< | Less-Than Sign |
> | Greater-Than Sign |
Other supported filling methods gives additional character ranges to the capability of RER engine. The description of these ranges can be found in OCR special filling methods and in the summary table of OCR Special Characters.
The compulsory knowledge base file is kadmos.uk
. The other files with .rec
extension are optional, removable, selectable and combinable with each other manually. From them, only the general knowledge base file hand_s.rec
is installed with the module during installation of OmniPage Capture SDK v20. The remainder are only in the folder RER_KBFILES
of the OmniPage CSDK install CD. (The file hand_s.rec
is also here.) The file numplus.rec
contains only the knowledge about numbers and some miscellaneous characters. Language-specific knowledge base files are also distributed as listed in the table below. These files have names in the form hand_s_??.rec
, where the double question mark within the filename should be replaced by a country code as follows:
Code | Language(s) / Territory |
al | Albanian |
at | Austrian, German |
be | Belgian, Dutch, French, German |
ch | Swiss, French, German, Italian |
cs | Czech, Slovakian |
cz | Czech |
de | German |
dk | Danish |
ee | Estonian |
es | Spanish |
eu | West-European |
fi | Finnish |
fr | French |
hu | Hungarian |
ie | Irish, English, Gaelic Irish |
it | Italian |
lt | Lithuanian |
lv | Latvian |
nl | Dutch |
no | Norwegian |
pl | Polish |
pt | Portuguese |
ro | Romanian |
se | Swedish |
sf | Scandinavia |
sl | Slovenian |
sk | Slovakian |
tr | Turkish |
uk | UK |
us | USA |
Using optional knowledge base file(s) may improve accuracy. Any subset of them can be simply copied manually into the Engine Binary directory before initiating the Engine. Although the system automatically identifies which knowledge base file is needed for a given situation (e.g. according to the language), recognition speed can be improved by minimizing the number of knowledge base files in the Engine Binary directory.
The module requires at least one .REC
file in the Engine Binary directory. It is not necessary to be HAND.REC
. On the other hand, the Redistribution Wizard of the CSDK tries to copy only HAND.REC
from the binary folder into the selected file set (and sends a message, if this file is not there). Thus if you want to see a different subset of optional knowledge base files in your redistributed file set you should select and copy it manually after running the Redistribution Wizard.
Handprint is much harder to recognize accurately than machine generated text, and success depends very heavily on character quality. The use of structured forms to limit the possible range of characters, together with zone-level filters and individual character validation can significantly improve accuracy. This recognition module can apply all the Engine’s possible filter elements to the 159-member character set it supports. Handprinted forms are usually filled by different respondents and this is liable to lower accuracy. If respondents can be given clear filling instructions (e.g. a print model to follow) and be motivated to print clearly, success will be higher.
If the handprint contains numbers only, using the RM_HNR module is likely to give better results than the RM_RER module filtered for numbers only. The functioning of the module can be influenced by the page-level trade-off settings: TO_ACCURATE is respected, while TO_FAST and TO_BALANCED are merged.
For successful recognition, the characters should not touch each other. Each character can be zoned individually or a zone may contain one or more lines of characters. Each character must have a height of 30-180 pixels. Well formed characters written in pen are best recognized. Pencil and felt-tip pens give poorer results. When reading from pre-printed forms, dropout colored boxes can be useful to encourage respondents to write characters of even size and spacing. But then, they mustn’t use a pen with the dropout color.
Maximum number of characters in a line: 200.
Number of lines in a zone: No restriction.
The Engine cannot provide access to all the parameters of reRecognition’s KADMOS toolkit. Note however, that the recognition module can be fine-tuned through parameters of an INI file located under the section [Parm]
. A sample INI file RM_RER.INI
can be found in the above mentioned folder RER_KBFILES
. The full-path of the given INI file can be specified by the setting Kernel.Ocr.RER.UseParamFile, which replaces the function RecSetRMSpecParams
of the previous CSDK versions.
RER recognition module can recognize only machine printed (FM_OMNIFONT) characters of these languages. Handprinted characters are not supported.
For recognition of such text the given language should be set (LANG_THA, LANG_VIE, LANG_HEB) and Western languages should not be set (except English in one case - see next paragraph).
The module can recognize short English texts embedded in such language text. If embedded texts are in other Latin-alphabet languages, their recognition is also possible, however accented characters may not always be handled correctly. English language MUST be se for embedded text recognition of any Latin-alphabet language. (See also CCJK and Arabic language handling details.)
IMPORTANT NOTE: For the correct working of the recognition of these languages, the language should be set before the preprocess operation.