RecAPI
|
One of the Engine's settings is the code page, which can be accessed through the functions kRecSetCodePage and kRecGetCodePage, respectively. The Language, Character Set and Code Page Handling Module is the module responsible for its handling.
Recognized characters are stored internally in the Engine in their UNICODE representation. The current code page is taken into account either when converting a character to/from this UNICODE representation, or when converting the recognition data to the final output document. The first needs to be done with the kRecConvertCodePage2Unicode or kRecConvertUnicode2CodePage calls.
The output conversion process performs character code conversions from UNICODE into the current code page while producing the final output document.
The kRecGetFirstCodePage and the kRecGetNextCodePage function-pair can be used to enumerate the list of available code pages.
There can be conflicts between the set of characters validated for recognition (see the topic Defining the character set) and the code page selection; a selected code page may not support some characters. For example, if you select the Hungarian language and the current code page is Windows ANSI (code page 1252), the final output document will not contain some accented characters for that language. Use the kRecCheckCodePage function to check whether the current code page setting contains all the characters of the current Language environment (language selection, the LanguagesPlus characters), and any characters listed as FilterPlus characters. The output of kRecCheckCodePage
is a string of characters not supported by the current code page (non-supported characters). If there are non-supported characters when output conversion is performed, the Engine tries to replace non-supported characters with somewhat similar shaped ones in the final output document. This substitution does not work in all cases; mainly it is good for replacing non-supported accented characters with un-accented ones. The final output document will contain a missing symbol in the place of characters that were recognized correctly but could not be either exported or substituted.
The application can call kRecSetMissingSymbol to define which character from the current code page should be used to indicate a missing symbol.
The page-level processing contains only simple output converters (because of page level requirements). The Direct TXT Output Converter Module is responsible for realizing this step of the page processing. The functions kRecSetDTXTFormat and kRecGetDTXTFormat provide access to the setting specifying the output converter. The selected output converter can be any of the following (DTXTOUTPUTFORMATS):
The working of each converter can be fine-tuned through settings.
The integrating application can call the output conversion by kRecConvert2DTXT.
RECERR rc; ... HPAGE *hPages; HIMGFILE hIFile; int pageCnt, i; // Selecting Hungarian language LANGUAGES langs[LANG_SIZE]; memset(langs, 0, sizeof(LANGUAGES)*LANG_SIZE); langs[LANG_HUN] = LANG_ENA; rc = kRecSetLanguages(0, langs); // Selecting the codepage for Hungarian language rc = kRecSetCodePage(0, "Windows Eastern"); // Load image. rc = kRecOpenImgFile("multipage.tif", &hIFile, IMGF_READ, (IMF_FORMAT)0); // Get number of pages. rc = kRecGetImgFilePageCount(hIFile, &pageCnt); // Create an array for the pages. hPages = new HPAGE[pageCnt]; // Cycle through the pages. for(i=0;i<pageCnt;i++) { // Load current page. rc = kRecLoadImg(0, hIFile, &(hPages[i]), i); // Preprocess image. rc = kRecPreprocessImg(0, hPages[i]); // Recognize image. rc = kRecRecognize(0, hPages[i], NULL); } // Close file. rc = kRecCloseImgFile(hIFile); // Set conversion format to PDF image on text. rc = kRecSetDTXTFormat(0, DTXT_PDFIOT); // Convert all the pages into a PDF file. rc = kRecConvert2DTXT(0, hPages, i, "multipage.pdf"); // Free up memory. for(i=0;i<pageCnt;i++) rc = kRecFreeImg(hPages[i]); delete[] hPages; ...