RecAPI
MOR multi-lingual omnifont recognition module
Module name: MOR
Module identifier: RM_OMNIFONT_MOR
Filling methods supported: FM_OMNIFONT, FM_DRAFTDOT24, FM_OCRA, FM_OCRB
Filters supported: all filter elements
Trade-off supported: TO_FAST, TO_BALANCED, TO_ACCURATE
Knowledge base files: RECOGN.BCT and RECOGN24.BCT
Training file supported: yes

This recognition module is supported on: Windows, Linux.

On Windows, the PLUS2W and PLUS3W recognition modules also require the presence of this module. This module is supplied in both the Professional Recognition Kit and the OCR Kit. Its inclusion in your application must be covered by a distribution license. See the topic on Licensing in the General Information help system.

Application areas

This module recognizes machine printed text; i.e. from printed publications, laser or ink-jet printers and electric typewriters. Output from mechanical typewriters in good condition may also be acceptable. It could also be used for letter or near letter quality (NLQ, LQ) output from dot-matrix printers. For Draft quality 24-pin dot-matrix documents use the FM_DRAFTDOT24 filling method. NLQ or LQ quality output can usually be better recognized without using FM_DRAFTDOT24.

See kRecSetMorFaxed for reading standard mode (about 200 x 100 DPI) faxed documents.

The max. number of zones defined on an image that this module can handle is 500.

Range of characters

This module can recognize about 500 characters, termed Engine’s Total Character Set. It includes the letters of the Latin, Greek and Cyrillic alphabets with enough accented letters to recognize the non-Asian Languages supported by the Engine

The set is classified as follows:

Non-accented Accented
Latin alphabet upper case letters 26 89
Latin alphabet lower case letters 26 91
Digits 10
Punctuation 29
Miscellaneous (maths symbols etc.) 55
Cyrillic upper case letters 33 14
Cyrillic lower case letters 33 14
Greek upper case letters 24 9
Greek lower case letters 25 11
OCR (OCR-A / MICR) characters 3

The characters are listed in category and alphanumeric order, together with their Code Page values, in Characters and Code Pages. These are the character categories used by the filter elements. The pre-trained OCR characters are: OCR Chair, OCR Hook, OCR Fork.

Character attributes

The omnifont recognition module can detect and transmit character attributes: bold, italic or underlined text (or any combination of them). It can also detect and transmit character size, and can classify font types into three broad categories: serif, sans serif and monospaced.

Speed/Accuracy choices

The multi-lingual omnifont recognition module basically uses contour analysis, but can supplement this with an innovative form of pattern matching not requiring enormous pre-stored shape libraries.

This module interprets all three page-level recognition trade-off settings: TO_ACCURATE, TO_BALANCED and TO_FAST.

The module is tightly integrated with the checking module, giving a total of five speed/accuracy choices.

  • Level 1: TO_FAST without checking.
    Fastest. The module reads text once and uses feature extraction only. Even this setting can give excellent accuracy on high-quality documents. Recommended also when accuracy is not a big issue (e.g. when OCR is only to allow fuzzy keyword searching in a document retrieval system) or for high-volume work when processing speed is most important.
  • Level 2: TO_FAST with checking.
    The recognition module reads text only once, with feature extraction, but sends words containing suspect or reject characters to a checker, together with its first and second guesses for unsure characters. The checker tries to find solutions based only on those characters. It also tries to repair other typical OCR faults (e.g. di9its embedded in words) and will flag all non-dictionary words it was unable to solve. Recommended e.g. when a Language dictionary is available and the texts are mono-lingual and liable to contain normal language (if not, a User dictionary could be employed).
  • Level 3: TO_BALANCED without checking.
    Two-pass recognition. During the first pass with feature extraction, the program builds up a library of sample characters and ligatured character pairs from the page, whose recognition was very sure. During the second reading pass it stops on all reject and unsure characters, consults its library and uses pattern matching to try and find solutions. That’s why the second pass is not very useful for pages with very little text – the library is too small. Recommended for multi-lingual documents or when a checker is not available.
  • Level 4: TO_BALANCED with checking.
    Two-pass recognition. Reading is a combination of the two processes used in levels 2 and 3. More accurate but processing will take more time.
  • Level 5: TO_ACCURATE with checking compulsory.
    Most accurate but slowest. Designed for use on very degraded mono-lingual documents or when maximum accuracy is very important. It involves two-pass recognition with Adaptive Cell Analysis. This is used to get a bigger library for the pattern matching: uniformly highly degraded documents typically can’t yield enough surely recognized characters to form a useful library. With ACA recognition, characters with somewhat lower certainty are accepted, provided they fall within words accepted by the checking module. This allows the pattern matching to work more successfully.
Note:
See MOR Recognition Engine Module.