RecAPI
|
Nuance OmniPage Capture SDK 20 provides extensive support for Portable Document Format files both on the input and output sides. This datasheet gives you an overview on this area.
Both PDF input and output are supported on: Windows, Linux and on Mac OS X. In addition PDF output is also supported on Embedded Linux and Android. This information is true also for PDF_MRC.
PDF Input is supplied in both the Professional Recognition Kit and the OCR Kit. However the PDF Output Kit is an optional add-on. For more details see the topic on Licensing in the General Information help system.
Name | Adobe Portable Document Format |
Format ID | FF_PDF_* |
Image load (read) | Yes |
Image save (write) | Yes |
Image types supported | As the result of the image loading process either a B/W, or a 8-bit grayscale, paletted or a 24-bit true-color image will be created in the Engine. |
Multi-page supported | Yes |
Special note | Supports standard PDF files compliant up to the PDF v2.0 specification. |
PDF input is supported on: Windows, Linux and Mac OS X.
If you develop a Linux application using CSDK, please see also the Linux specific notes about PDF input.
By default (this can be changed using different settings), the program handles input PDF files as follows:
Kernel.Imf.DefaultDPI
.Kernel.Imf.PDF.Resolution
is not 0
:Kernel.Imf.PDF.LoadOriginalDPI
is TRUE
:PDF files may be password-protected. Passwords have two types: open (or user) and permissions (or owner, or master).
Open passwords can block file opening. When a PDF requires an open password, CSDK 20 cannot open it without this. Your application must include an interface to accept a password.
As for permissions passwords CSDK 20 only checks the permissions that block printing or content-copy from the file.
When a PDF requires a permissions password for content-copy, its text content cannot be copied without this. CSDK 20 however gives you the possibility to process a content-copy protected file without giving a permissions password. In this case the encrypted PDF is treated as an image-only one and no textual information can be extracted.
A PDF may also require a permissions password for printing. CSDK 20 will only load a PDF if its printing is not blocked – that is, the user either has this permissions password, or the file is not protected against printing.
When a PDF file is protected by both an open and a permissions password, only the permissions password needs to be given for full access.
CSDK 20 is able to produce PDF files on the KernelAPI, RecAPIPlus as well as on the IPRO layers.
PDF output in KernelAPI is supported on: Windows, Linux, Embedded Linux, Mac OS X.
For creating image-only PDF files, KernelAPI offers the following format choices:
FF_PDF_MIN | Minimum image file size |
FF_PDF_GOOD | Medium image file size |
FF_PDF_SUPERB | Large image file size with high quality |
FF_PDF_MRC_MIN | MRC-compressed PDF optimized for minimum file size |
FF_PDF_MRC_GOOD | MRC-compressed PDF of medium file size |
FF_PDF_MRC_SUPERB | MRC-compressed PDF providing large file size, but high quality |
In case of MRC formats the image is saved in multiple layers:
The background, foreground and selector layers are compressed using different compression algorithms.
When creating an MRC file CSDK decomposes the image into multiple layers and sub-images. This process includes an algorithm for detecting text. This process can benefit from the OCR result and no text detection is needed if the HPAGE contains an II_OCR image. Note that II_OCR
image is created during the recognition process and in default it is freed after recognition. In order to keep it the setting Kernel.OcrMgr.Images.KeepOcrImage must be set to TRUE
before the recognition.
About PDF MRC format see also the below section Saving MRC PDF files in KernelAPI.
The option of creating image-only PDFs is available directly after image input.
The other PDF output available in KernelAPI is DirectTXT PDF (DTXT_PDFIOT). It contains the whole image of the original page and the text behind the image on a separate layer. These PDF files especially suit the purpose of page archiving, because they contain both the original image and recognized text.
It is recommended to set Kernel.OcrMgr.Images.KeepOcrImage TRUE
for creating a DirectTXT PDF as mentioned previously.
DTXT PDF output also provides the different PDF qualities, but while there are different formats for these qualities in the case of image-only PDF, in DTXT the same output format should be selected and settings specify the given quality levels and determines if MRC is used. See the settings of DTXT image-on-text PDF output.
The PDF output in RecAPIPlus level is supported on: Windows.
In RecAPIPlus and IPRO, the following PDF output formats and converters are available:
PDF with image on text (Searchable PDF in OmniPage terminology) – A PDF converter where the original (input) image is retained in the foreground with the recognized text hidden in the background (and in the correct position). This format allows the content of an image PDF to become searchable without disrupting the original due to the hidden text layer. Text in a Searchable PDF is positioned directly behind the corresponding image text and is selectable and searchable in popular PDF viewers. This format especially suits archiving and indexing purposes.
PDF - A highly configurable, general PDF output converter. It supports many PDF features, but relies heavily on the position of the recognized characters.
PDF with image substitutes - A special PDF converter, where the suspect words are covered by their image cut out from the original image.
PDF edited – This PDF converter does not rely on the position of the recognized characters, so it could be used even after inserting large new portions of text in the editor.
All PDF Output converters have the following features in common:
CSDK gives the possibility to save encrypted PDF files using either an open (or user) or a permissions (or owner, or master) password (see also the above description about encrypted PDF files). This option is fully setting-controlled. The setting Kernel.Imf.PDF.PDFSecurity.Type determines the used encryption method. The passwords can be stored also in settings (Kernel.Imf.PDF.PDFSecurity.OwnerPassword
and UserPassword
).
In addition there can be set permission flags for the created PDF file enabling:
The created PDF file can be opened for the enabled operations by using the open password. If one has the permissions password, all the operations are enabled.
Existing encrypted PDF files can be modified and saved as well. In this case the password of the existing file is needed for processing it. The file and the given password determines the operations can be performed, so the setting Kernel.Imf.PDF.PDFSecurity.Type
has no effect. The password (either open or permission one) can be specified in the setting Kernel.Imf.PDF.PDFSecurity.ProcessPassword.
There is no MRC compression for black-and-white images. Thus if an MRC format is selected for a B/W image the proper no-MRC format is used.
There may be saved both image-only and image-on-text MRC PDF files. One is saved by Image File Handling Module, the other is saved by Direct TXT Output Converter Module. Below description uses the notions of image-only case, however the proper notions can be found for image-on-text case in a natural way. For more information see the above section about PDF Output in KernelAPI.
DTXT image-on-text PDF output also provides the different PDF qualities, but while there are different formats for these qualities in the case of image-only PDF, in DTXT the same output format should be selected and settings sepcify the given quality levels and determines if MRC is used. See the settings of DTXT image-on-text PDF output.
Compression methods:
Compression Quality:
Selector layer | Foreground layer | Background layer | ||||
Group4 | JBIG2 | Jpeg | Jpeg2000 | Jpeg | Jpeg2000 | |
FF_PDF_MRC_SUPERB | lossless | lossless | 50% | 50% | 75% | 75% |
FF_PDF_MRC_GOOD | lossless | lossy | not used | not used | 75% | 75% |
FF_PDF_MRC_MIN | lossless | lossy | not used | not used | 50% | 50% |
Resolution of the layers:
MRC Compression | Resolution of the Current Image | Resolution of the Selector Layer compared to the Current Image | Resolution of the Foreground Layer compared to the Selector Layer | Resolution of the Background Layer compared to the Selector Layer |
FF_PDF_MRC_SUPERB | 1, 1.5 or 2 depending on the current Resolution Enhancement setting (see note 1. below) | same as in case of MRC_GOOD (see note 2. below) or 1 if resolution enhancement was not applied, otherwise 1/2 (see note 3. below) | ||
FF_PDF_MRC_GOOD and FF_PDF_MRC_MIN | <= 450 DPI | 1/3 | 1/3 | |
> 450 DPI | 1/4 |
II_BW
image does not exist it will be created using the current settings (see secondary image conversion). II_BW
image the resolution of all three layers will be the same as the resolution of the current (II_CURRENT) image.2
then the resolution of the FG and BG layer will be the same as the resolution of the current image. If the ratio of the resolution enhancement is 1.5
then the resolution of the FG and BG layer will be the 3/4 of the resolution of the current image.When a document is saved in PDF formats, the image can be saved using MRC technology. In this case the same compression methods and compression quality are used and the resolution of the layers also will be the same as described previously.
The following converter settings affect the compression:
The following combinations can be used:
Compatibility | UseMRC | Compression.UseJBIG2 | Compression.UseJPEG2000 |
R2ID_PDF_FORCESIZE | R2ID_PDFMRC_MIN | TRUE | TRUE |
R2ID_PDF_FORCEQUALITY | R2ID_PDFMRC_NO | ||
R2ID_PDF20 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDF17 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDF16 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDF15 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDF14 | any possible value | TRUE/FALSE | FALSE |
R2ID_PDF13 | any possible value | FALSE | FALSE |
R2ID_PDF12 and below | R2ID_PDFMRC_NO | ||
R2ID_PDFA (deprecated) | any possible value | TRUE/FALSE | FALSE |
R2ID_PDFA1B (instead of R2ID_PDFA) | any possible value | TRUE/FALSE | FALSE |
R2ID_PDFA2B | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA3B | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA2U | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA3U | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA1A | any possible value | TRUE/FALSE | FALSE |
R2ID_PDFA2A | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA3A | any possible value | TRUE/FALSE | TRUE/FALSE |
The Pictures
setting may modify the resolution of the layers. If the resolution of a layer is higher than the resolution specified by the Pictures
setting, the layer will be transformed to the specified resolution, so this setting is suggested to leave in default state (R2_DPI_ORIGINAL
) when saving MRC PDF.
The PictureColor
setting may change the bit depth of the layers, so it is suggested to leave in default state (R2_BPP_ORIGINAL
).
Generating editable output from PDF files has been speeded up – more advanced technology is applied to make zoning faster, achieve higher OCR accuracy, improve output quality and make the resulting files more usable when being further edited in target applications. This is achieved by creating two images whenever the input is a PDF or XPS file with a text layer. One is a composite image with all PDF information, the second contains only a background image without any text. This is especially useful for pages where text wraps around pictures irregularly, as shown. Further speed-up is achieved by assessing image quality and layout complexity. A faster OCR algorithm is now applied to high-quality pages with simple layouts.
This technology cannot be applied to active PDF forms, and works only in accurate mode. In cases where this technology cannot be applied, there is an automatic and seamless fall-back to the old algorithm.
Another innovation is support for creating linearized PDF files. These are optimized for efficient web display. The resulting PDF adheres to Appendix F (Linearized PDF) of the PDF Reference. This means that after creating the PDF in the usual way (any PDF flavor), the CSDK reorders the file contents and adds hint tables. This means that the first page of the PDF will load quickly into a web page, with remaining pages loaded while it is being viewed. It means browsers can determine which page elements to present first (typically headings and texts) and which can follow (heavier pictures etc.). It also optimizes the efficiency of jumping to new pages in the PDF document.
Settings relating to the creation of linearized PDF are: Kernel.Imf.PDF.Linearized, Kernel.DTxt.PDF.Linearized and Converters.Text.PDF*.Linearized.
Linearized file creation works also with Asian-language PDF files.
Support is introduced for PDF version 1.6 and 1.7 – this includes support for the AES encryption system (128 and 256 bit). File opening is handled through existing mechanisms, while new saving options are provided for applying AES encryption to files.
In response to client requests, it is now possible to have the original orientations conserved when outputting to PDF Searchable (Image-on-text) files. To implement this, the setting Kernel.Img.KeepOriginalImage or the corresponding function kRecSetPreserveOriginalImg must be used for each page involved. If these images are kept, the orientation on PDF Searchable output pages remains the same as that of the input, overriding any auto-rotation decisions that may have been performed during preprocessing.
This part of the SDK is an extension to KernelAPI and RecAPIPlus. It manages PDF files on the page level. It can copy, move, or delete pages of the PDF files. It is also able to extract information from them, or change their pages. RecPDF is a mostly operation based API. The page-level modifications are passed to the operation, and at the end the operation executes all of the changes at the same time. Operations can be cancelled as well if it turns out that no modification is needed. For more information see the documentation of the RecPDF Module.