RecAPI
|
Simple output converters. More...
Enumerations | |
enum | DTXTOUTPUTFORMATS { DTXT_TXTS, DTXT_TXTCSV, DTXT_TXTF, DTXT_PDFIOT, DTXT_XMLCOORD, DTXT_BINARY, DTXT_IOTPDF, DTXT_IOTPDF_MRC } |
DTXT output formats. More... | |
Functions | |
RECERR RECAPIKRN | kRecSetDTXTFormat (int sid, DTXTOUTPUTFORMATS dFormat) |
Changing DTXT format. | |
RECERR RECAPIKRN | kRecGetDTXTFormat (int sid, DTXTOUTPUTFORMATS *pdFormat) |
Getting DTXT format. | |
RECERR RECAPIKRN | kRecConvert2DTXT (int sid, const HPAGE *ahPage, int nPage, LPCTSTR pFilename) |
Converting pages with DTXT. | |
RECERR RECAPIKRN | kRecConvert2DTXTEx (int sid, const HPAGE *ahPage, int nPage, IMAGEINDEX iiImg, LPCTSTR pFilename) |
Converting pages with DTXT. | |
RECERR RECAPIKRN | kRecMakePagesSearchable (int sid, LPCTSTR pFilename, int fromPage, const HPAGE *ahPage, int nPage, IMAGEINDEX iiImg) |
Making a PDF page searchable. |
Simple output converters.
This module gives you the possibility to convert recognized text simply and quickly. That is, you use the output of the recognition module as is (without reading order and paragraph detection). Therefore the DirectTXT Outputs are simpler than the Layout Retention Output conversions (available in RecAPIPlus) and also faster to produce, because they do not include slow detection processes.
There are different functions for starting the DTXT conversion. The older function kRecConvert2DTXT can create all the possible DTXT formats. Its newer successor is kRecConvert2DTXTEx, which has an additional IMAGEINDEX parameter for controlling the orientation of the pages creating a PDF file. This latter functions also can create all DTXT formats.
For existing PDF files a special conversion method can be applied. The function kRecMakePagesSearchable inserts invisible text (coming from a recognition step) into the PDF file (in-place), i.e. it makes the file searchable.
You can have control over DirectTXT output behavior through various settings. The root of DirectTXT settings is Kernel.DTxt
. These settings can be queried and modified through Settings Manager Module. The following DirectTXT output types can be selected by calling kRecSetDTXTFormat:
The code page used at generating DTXT_TXT*
output files can be specified by the setting Kernel.Chr.CodePage, or the function kRecSetCodePage.
The DirectTXT Text (DTXT_TXTS) output is a simple text file. The settings used by this converter are as follows:
Kernel.DTxt.UnicodeFileHeader
Kernel.DTxt.IntelByteOrder
Kernel.DTxt.PageBreak
Kernel.DTxt.txt.LineBreak
Kernel.DTxt.txt.IgnoreSpaceAtEOL
Kernel.DTxt.txt.CellLineBreak
Kernel.DTxt.txt.BeginCell
Kernel.DTxt.txt.EndCell
Kernel.DTxt.txt.CellSeparator
Kernel.DTxt.txt.ZoneSeparator
The DirectTXT CSV (DTXT_TXTCSV) output is a simple format to represent tables. Microsoft Excel can read this format. The settings used by this converter are as follows:
Kernel.DTxt.UnicodeFileHeader
Kernel.DTxt.IntelByteOrder
Kernel.DTxt.PageBreak
Kernel.DTxt.csv.EndOfRecord
Kernel.DTxt.csv.BeginField
Kernel.DTxt.csv.EndField
Kernel.DTxt.csv.FieldSeparator
Kernel.DTxt.csv.EndOfLineAsFieldSeparator
Kernel.DTxt.csv.EndOfCellLineAsFieldSeparator
Kernel.DTxt.csv.RecordSeparator
When you want to process forms, you can collect data into one row for each page. E.g. Kernel.DTxt.PageBreak = ""; Kernel.DTxt.csv.RecordSeparator = 2
The DirectTXT Formatted Text (DTXT_TXTF) delivers plain text, but attempts to keep layout as detected in the original image: this creates a text file that simulates columns and boxes using tabulators. The settings used by this converter are as follows:
Kernel.DTxt.UnicodeFileHeader
Kernel.DTxt.IntelByteOrder
Kernel.DTxt.PageBreak
Kernel.DTxt.txt.LineBreak
The newer DirectTXT PDF formats (DTXT_IOTPDF and DTXT_IOTPDF_MRC) (supported on: Windows, Linux, Embedded Linux, Mac OS X) contain the whole image of the original page and the text behind the image on a separate layer. These pdf files especially suit the purpose of page archiving, because they contain both the image and the searchable recognized text. There are possibilities to affect on the quality of the generated PDF file. For details see the section about newer image formats.
It is recommended to use these newer formats instead of the deprecated one (DTXT_PDFIOT
).
See also PDF Files in CSDK 20.
DTXT_PDFIOT is deprecated now. Although it can be used, but DTXT_IOTPDF and DTXT_IOTPDF_MRC are suggested to use instead.
The DirectTXT PDF (DTXT_PDFIOT) output (supported on: Windows, Linux, Embedded Linux, Mac OS X) contains the whole image of the original page and the text behind the image on a separate layer. These pdf files especially suit the purpose of page archiving, because they contain both the image and the searchable recognized text. The following settings affect the output:
Kernel.DTxt.PDF.BWFormat
Kernel.DTxt.PDF.ColorFormat
Kernel.DTxt.PDF.BWQuality
Kernel.DTxt.PDF.ColorQuality
Kernel.DTxt.PDF.BWMaxDPI
Kernel.DTxt.PDF.ColorMaxDPI
Kernel.DTxt.PDF.CompressContentStream
Kernel.DTxt.PDF.Linearized
Kernel.DTxt.PDF.PDFA
See also PDF Files in CSDK 20.
The DirectTXT XML (DTXT_XMLCOORD) output is typically used for further processing recognized data. You can easily parse (e.g. MSXML) or transform (XSLT) the output xml file. The format of the xml output is specified by the same scheme as the Layout Retention Xml Output (http://www.nuance.com/omnipage/xml/ssdoc-schema3.xsd or local <CSDK>bin/ssdoc-schema3.xsd). The settings used by this converter are as follows:
Kernel.DTxt.xml.XSD
Kernel.DTxt.xml.InsertCharacters
Kernel.DTxt.txt.SchemaLocation
Kernel.DTxt.txt.Title
Kernel.DTxt.txt.Subject
Kernel.DTxt.txt.Author
Kernel.DTxt.txt.Company
Kernel.DTxt.txt.Comment
The DirectTXT Binary (DTXT_BINARY) output is used for creating files directly from the LETTER array (i.e. the recognition result) without any character conversion and formatting. It is designed for the barcodes containing binary data (for example BAR_C128 or BAR_PDF417 barcodes containing encrypted data). This method does not perform code page conversion. The lines do not contain spaces at the end of them, except when these spaces are in itself the binary data (i.e. this method removes the zero width end-line spaces). Using this output method, the barcode modules must be forced to generate also binary result, which can be specified by the setting Kernel.OcrMgr.BarBinary
. See the topics Binary output, The settings of the BAR Recognition Engine Module.
When you specify an already existing file name, all the types except DTXT_BINARY are appended. You can set and get the output format using kRecSetDTXTFormat and kRecGetDTXTFormat. To perform conversion use kRecConvert2DTXT.
The DTXT module can be especially useful for applications that do not require formatting but speed is an important factor. For example: indexing, archiving, or form processing applications.
For a collected list of the DTXT settings see the Settings of the Direct TXT Module.
enum DTXTOUTPUTFORMATS |
DTXT output formats.
The following output formats can be created by Direct TXT output converter. All of them except DTXT_BINARY are appendable. Some of the selectable output formats can be balanced by settings. See Settings of the Direct TXT Module.
DTXT_TXTS |
Text Standard. Details... |
DTXT_TXTCSV |
Text CSV. Details... |
DTXT_TXTF |
Text Formatted. Details... |
DTXT_PDFIOT |
Deprecated (see usage of new formats). PDF Image on Text. Supported on: Windows, Linux, Embedded Linux, Mac OS X. Details... |
DTXT_XMLCOORD |
XML Simple. Details... |
DTXT_BINARY |
Binary output. Details... |
DTXT_IOTPDF |
Image on Text PDF with changeable compression level. Supported on: Windows, Linux, Embedded Linux, Mac OS X. Details... |
DTXT_IOTPDF_MRC |
Image on Text PDF with MRC technology. Supported on: Windows, Linux, Embedded Linux, Mac OS X. Details... |
Converting pages with DTXT.
This function converts the given pages using Direct TXT output converter.
[in] | sid | Settings Collection ID. |
[in] | ahPage | Array of HPAGEs to be converted. |
[in] | nPage | Number of HPAGEs in ahPage . |
[in] | pFilename | File name of the resulted file. |
RECERR |
HPAGE
's may be rather big memory areas, thus keeping them in memory simultaneously may cause memory errors. All the DTXT types except DTXT_BINARY are appendable, thus it is recommended to append them page-by-page (or per some pages) to the same file instead of using a large array containing all of the HPAGE
's. hPage
contains DataStream this function may put the image into the output file without recompression. See the details in the section about DataStream. DTXT_TXT*
output files can be specified by the setting Kernel.Chr.CodePage, or the function kRecSetCodePage. RECERR kRecConvert2DTXT(int sid, IntPtr[] ahPage, string pFilename); // or RECERR kRecConvert2DTXT(int sid, IntPtr ahPage, string pFilename);
RECERR RECAPIKRN kRecConvert2DTXTEx | ( | int | sid, |
const HPAGE * | ahPage, | ||
int | nPage, | ||
IMAGEINDEX | iiImg, | ||
LPCTSTR | pFilename | ||
) |
Converting pages with DTXT.
This function converts the given pages using Direct TXT output converter.
[in] | sid | Settings Collection ID. |
[in] | ahPage | Array of HPAGEs to be converted. |
[in] | nPage | Number of HPAGEs in ahPage . |
[in] | iiImg | Index of the image to be saved. (II_CURRENT or II_ORIGINAL) |
[in] | pFilename | File name of the resulted file. |
RECERR |
HPAGE
's may be rather big memory areas, thus keeping them in memory simultaneously may cause memory errors. All the DTXT types except DTXT_BINARY are appendable, thus it is recommended to append them page-by-page (or per some pages) to the same file instead of using a large array containing all of the HPAGE
's. hPage
contains DataStream this function may put the image into the output file without recompression. See the details in the section about DataStream. iiImg
specifies the image used to create the PDF file. II_ORIGINAL can be used only if the original image or DataStream is available. See also kRecSetPreserveOriginalImg and the documentation of DataStream. DTXT_TXT*
output files can be specified by the setting Kernel.Chr.CodePage, or the function kRecSetCodePage. RECERR kRecConvert2DTXTEx(int sid, IntPtr[] ahPage, IMAGEINDEX iiImg, string pFilename); // or RECERR kRecConvert2DTXTEx(int sid, IntPtr ahPage, IMAGEINDEX iiImg, string pFilename);
RECERR RECAPIKRN kRecGetDTXTFormat | ( | int | sid, |
DTXTOUTPUTFORMATS * | pdFormat | ||
) |
Getting DTXT format.
This function retrieves the Direct TXT output format.
[in] | sid | Settings Collection ID. |
[out] | pdFormat | Pointer of a variable to store the output format. |
RECERR |
RECERR kRecGetDTXTFormat(int sid, out DTXTOUTPUTFORMATS pdFormat);
RECERR RECAPIKRN kRecMakePagesSearchable | ( | int | sid, |
LPCTSTR | pFilename, | ||
int | fromPage, | ||
const HPAGE * | ahPage, | ||
int | nPage, | ||
IMAGEINDEX | iiImg | ||
) |
Making a PDF page searchable.
This function writes invisible textual information into a PDF to make it searchable/readable
[in] | sid | Settings Collection ID. |
[in] | pFilename | Name of the file to be made searchable |
[in] | fromPage | Index of the first page to be made searchable (zero start index). The function processes the nPage pages starting from fromPage . |
[in] | ahPage | Array of HPAGE 's containing the searchable/textual data (comes from kRecRecognize). |
[in] | nPage | Number of HPAGE 's in ahPage |
[in] | iiImg | Index of the image to use for orienting the pages. (II_ORIGINAL or II_CURRENT) |
RECERR |
pFilename
should not be opened during kRecMakePagesSearchable. It is recommended to make searchable page-by-page instead of using an array with lots of HPAGE
's. Since each open/close requires greater resources, grouping HPAGE
's are supported. However HPAGE
's may be rather big memory areas, thus keeping them in memory simultaneously may cause memory errors. "Kernel.OcrMgr.PDF.ProcessingMode"=PDF_PM_GRAPHICS_ONLY HPAGE hPage; rc = kRecLoadImgF(sid, pFilename, &hPage, i); rc = kRecPreprocessImg(sid, hPage); rc = kRecRecognize(sid, hPage, NULL); rc = kRecMakePagesSearchable(sid, pFilename, i, &hPage, 1, II_CURRENT); rc = kRecFreeImg(hPage);
HIMGFILE hIFile = NULL; HPAGE hPage[NPAGES]; .. rc = kRecOpenImgFile(pFilename, &hIFile, IMGF_READ, (IMF_FORMAT)0); .. for(ipage) rc = kRecLoadImg(sid, hIFile, hPage+ipages, first_page+ipage); .. rc = kRecCloseImgFile(hIFile); .. //Preprocessing must be somewhere between load and recognize. //So you can also attach it to the loop of either load or recognize. for(ipage) rc = kRecPreprocessImg(sid, hPage[ipage]); .. for(ipage) rc = kRecRecognize(sid, hPage[ipage], NULL); .. rc = kRecMakePagesSearchable(sid, pFilename, first_page, hPage, NPAGES); .. for(ipage) rc = kRecFreeImg(hPage[ipage]);
iiImg
). RECERR kRecMakePagesSearchable(int sid, string pFilename, int fromPage, IntPtr[] ahPage, IMAGEINDEX iiImg); // or RECERR kRecMakePagesSearchable(int sid, string pFilename, int fromPage, IntPtr ahPage, IMAGEINDEX iiImg);
RECERR RECAPIKRN kRecSetDTXTFormat | ( | int | sid, |
DTXTOUTPUTFORMATS | dFormat | ||
) |
Changing DTXT format.
This function changes the Direct TXT output format setting.
[in] | sid | Settings Collection ID. |
[in] | dFormat | The output format to be set. |
RECERR |
RECERR kRecSetDTXTFormat(int sid, DTXTOUTPUTFORMATS dFormat);