RecAPI
PDF Files in Nuance OmniPage Capture SDK 20

Nuance OmniPage Capture SDK 20 provides extensive support for Portable Document Format files both on the input and output sides. This datasheet gives you an overview on this area.

Both PDF input and output are supported on: Windows, Linux and on Mac OS X. In addition PDF output is also supported on Embedded Linux and Android. This information is true also for PDF_MRC.

PDF Input is supplied in both the Professional Recognition Kit and the OCR Kit. However the PDF Output Kit is an optional add-on. For more details see the topic on Licensing in the General Information help system.

PDF File Format Summary

Name Adobe Portable Document Format
Format ID FF_PDF_*
Image load (read) Yes
Image save (write) Yes
Image types supported As the result of the image loading process either a B/W, or a 8-bit grayscale, paletted or a 24-bit true-color image will be created in the Engine.
Multi-page supported Yes
Special note Supports standard PDF files compliant up to the PDF v2.0 specification.

PDF Input in CSDK 20

PDF input is supported on: Windows, Linux and Mac OS X.

If you develop a Linux application using CSDK, please see also the Linux specific notes about PDF input.

By default (this can be changed using different settings), the program handles input PDF files as follows:

  1. Bitmap creation
    A bitmap is created from the loaded PDF.
  2. Information extraction
    After this, additional information is extracted from the PDF, including the following:
    1. Information on fonts and the decision whether font substitution is necessary
    2. Information on text with the exact position of its letters
    3. TAG information, if any.
  3. Pre-processing
    The next step is the pre-processing. This step may vary depending on whether the PDF contains textual information or not.
    1. If the PDF does not contain text information (it is an image-only PDF), all pre-processing (deskew, auto-rotation, binarization) and other operations will run similarly to the ones applied to other image files (TIFF, JPEG, etc.)
      NOTE: Image on text-type PDFs do not undergo text extraction: during processing these are treated as image only ones.
    2. If the PDF does contain text information, a specific binarization (developed for PDF files) will run, but without deskew and auto-rotation.
  4. Layout decomposition
    In PDF files that contain textual information, this text is extracted. The OCR engine runs on the image, but mainly to search for text areas and other elements on the page resulting in a zone set. Page layout and spacing are also determined. The generated and zoned bitmap (see Step 1) is collated with the extracted text to ensure its correct positioning on the page, including column placement and transfer of graphic elements.
  5. Recognition
    In PDF files with no accessible text layer, OCR runs to generate editable text and perform zoning. In PDF files with an accessible text layer with text information, pages are zoned and word boundaries are determined, as described. Occasionally, text extraction may be imperfect, so as a backup, recognition with two-way voting runs on the images and its result is compared to the text information extracted from the PDF. In case of minor differences, recognized characters are corrected according to the ones in the PDF, since that text is more likely to be correct. In the case of major differences recognition results will serve as the final ones, since it is likely that PDF character encoding identification has failed (Non-standard encoding was not detected).
  6. Definition of character attributes
    Character attributes, such as size and style (bold, italic) can usually be defined using information extracted from the PDF. When PDF text is written in a type that is difficult to identify, a font attribute defining process will run.
  7. Other operations
    Character- and line spacing, paragraph, and table definitions are done just as in the case of image files.

Resolution of the rendered bitmap

  • By default:
    • If the PDF does not have an image, resolution is determined by the value of the setting Kernel.Imf.DefaultDPI.
    • If the PDF has an image and there is only "little" text, the resolution of the rendered bitmap will be the maximal resolution of the images. However, this resolution is limited to 300 DPI.
  • If the setting Kernel.Imf.PDF.Resolution is not 0:
    • The given value will be the resolution.
  • If the setting Kernel.Imf.PDF.LoadOriginalDPI is TRUE:
    • If the PDF does not have an image, resolution is determined as in the default case.
    • If the PDF has an image, the resolution of the rendered bitmap will be the maximal resolution of the images. However, this resolution is limited to 600 DPI.

Handling Encrypted PDF Files

PDF files may be password-protected. Passwords have two types: open (or user) and permissions (or owner, or master).

Open passwords can block file opening. When a PDF requires an open password, CSDK 20 cannot open it without this. Your application must include an interface to accept a password.

As for permissions passwords CSDK 20 only checks the permissions that block printing or content-copy from the file.

When a PDF requires a permissions password for content-copy, its text content cannot be copied without this. CSDK 20 however gives you the possibility to process a content-copy protected file without giving a permissions password. In this case the encrypted PDF is treated as an image-only one and no textual information can be extracted.

A PDF may also require a permissions password for printing. CSDK 20 will only load a PDF if its printing is not blocked – that is, the user either has this permissions password, or the file is not protected against printing.

When a PDF file is protected by both an open and a permissions password, only the permissions password needs to be given for full access.

PDF Output in CSDK 20

CSDK 20 is able to produce PDF files on the KernelAPI, RecAPIPlus as well as on the IPRO layers.

PDF Output in KernelAPI

PDF output in KernelAPI is supported on: Windows, Linux, Embedded Linux, Mac OS X.

For creating image-only PDF files, KernelAPI offers the following format choices:

FF_PDF_MIN Minimum image file size
FF_PDF_GOOD Medium image file size
FF_PDF_SUPERB Large image file size with high quality
FF_PDF_MRC_MIN MRC-compressed PDF optimized for minimum file size
FF_PDF_MRC_GOOD MRC-compressed PDF of medium file size
FF_PDF_MRC_SUPERB MRC-compressed PDF providing large file size, but high quality

In case of MRC formats the image is saved in multiple layers:

  • one background layer containing the graphics and the background behind the text,
  • one or more foreground layers containing the text,
  • optionally (depending on the current MRC format) one selector layer.

The background, foreground and selector layers are compressed using different compression algorithms.

When creating an MRC file CSDK decomposes the image into multiple layers and sub-images. This process includes an algorithm for detecting text. This process can benefit from the OCR result and no text detection is needed if the HPAGE contains an II_OCR image. Note that II_OCR image is created during the recognition process and in default it is freed after recognition. In order to keep it the setting Kernel.OcrMgr.Images.KeepOcrImage must be set to TRUE before the recognition.

About PDF MRC format see also the below section Saving MRC PDF files in KernelAPI.

The option of creating image-only PDFs is available directly after image input.

The other PDF output available in KernelAPI is DirectTXT PDF (DTXT_PDFIOT). It contains the whole image of the original page and the text behind the image on a separate layer. These PDF files especially suit the purpose of page archiving, because they contain both the original image and recognized text.

It is recommended to set Kernel.OcrMgr.Images.KeepOcrImage TRUE for creating a DirectTXT PDF as mentioned previously.

DTXT PDF output also provides the different PDF qualities, but while there are different formats for these qualities in the case of image-only PDF, in DTXT the same output format should be selected and settings specify the given quality levels and determines if MRC is used. See the settings of DTXT image-on-text PDF output.

PDF Output in RecAPIPlus and IPRO

The PDF output in RecAPIPlus level is supported on: Windows.

In RecAPIPlus and IPRO, the following PDF output formats and converters are available:

PDF with image on text (Searchable PDF in OmniPage terminology) – A PDF converter where the original (input) image is retained in the foreground with the recognized text hidden in the background (and in the correct position). This format allows the content of an image PDF to become searchable without disrupting the original due to the hidden text layer. Text in a Searchable PDF is positioned directly behind the corresponding image text and is selectable and searchable in popular PDF viewers. This format especially suits archiving and indexing purposes.

PDF - A highly configurable, general PDF output converter. It supports many PDF features, but relies heavily on the position of the recognized characters.

PDF with image substitutes - A special PDF converter, where the suspect words are covered by their image cut out from the original image.

PDF edited – This PDF converter does not rely on the position of the recognized characters, so it could be used even after inserting large new portions of text in the editor.

All PDF Output converters have the following features in common:

  • Compression options, including (for details see the Compression settings):
    • Content stream compression (flate)
    • JBIG2 compression for black/white images
      (available from PDF v1.4)
    • JPEG2000 compression for color images
      (available from PDF v1.5)
    • Compression of embedded font files
  • Appending the output to an existing PDF file
  • MRC compression of the image for even smaller size.
    MRC is short for Multi-Raster Content Technology. It segments images into layers and applies different compression algorithms to each layer, thus optimizing both file size and quality.
  • Creation of fillable PDF forms (with LFR – Logical Form Recognition)
  • Compatibility settings for PDF versions 1.0 - 1.7, 2.0
  • PDF/A compliant output (by modifying the setting Compatibility of the selected PDF converter)
    PDF/A is a normative ISO-compliant PDF file specification based on PDF 1.4 designed for two main purposes:
    • To generate PDF files that display and handle uniformly over the broadest possible range of operating systems, environments and PDF viewers or editors.
    • To generate PDF files that will remain viewable over a long period of time, so that archived material is protected against obsolescence due to techological innovation.
  • Predefined settings for highest quality or for smallest file size
  • Tagged PDF file creation based on our layout recognition
  • Font embedding
  • Selectable quality for image compression, image resolution and color depth
  • Outline tree creation for document and page thumbnails
  • Ability to exclude the text of headers and footers from the output
  • Digital signing of the created PDF files (see PDF converter settings Converters.Text.PDF*.Signature.*)
  • Security settings (extract content, modify content, print, etc) definition plus open and permissions password setting
  • Content encryption (40, 128 or 256 bit)
  • Highlighting of the recognized URLs in the text and/or turning them into clickable links

Saving new encrypted PDF files

CSDK gives the possibility to save encrypted PDF files using either an open (or user) or a permissions (or owner, or master) password (see also the above description about encrypted PDF files). This option is fully setting-controlled. The setting Kernel.Imf.PDF.PDFSecurity.Type determines the used encryption method. The passwords can be stored also in settings (Kernel.Imf.PDF.PDFSecurity.OwnerPassword and UserPassword).

In addition there can be set permission flags for the created PDF file enabling:

  • modifying the document contents,
  • extracting text and graphics from the extracted document for supporting accessibility to users with disabilities
  • copying text and graphics,
  • adding or modifying text annotations,
  • printing the document,
  • filling in forms and signing the document,
  • assembling the document: inserting, rotating and deleting pages.

The created PDF file can be opened for the enabled operations by using the open password. If one has the permissions password, all the operations are enabled.

Modifying an existing encrypted PDF file

Existing encrypted PDF files can be modified and saved as well. In this case the password of the existing file is needed for processing it. The file and the given password determines the operations can be performed, so the setting Kernel.Imf.PDF.PDFSecurity.Type has no effect. The password (either open or permission one) can be specified in the setting Kernel.Imf.PDF.PDFSecurity.ProcessPassword.

Saving MRC PDF files in KernelAPI

There is no MRC compression for black-and-white images. Thus if an MRC format is selected for a B/W image the proper no-MRC format is used.

There may be saved both image-only and image-on-text MRC PDF files. One is saved by Image File Handling Module, the other is saved by Direct TXT Output Converter Module. Below description uses the notions of image-only case, however the proper notions can be found for image-on-text case in a natural way. For more information see the above section about PDF Output in KernelAPI.

DTXT image-on-text PDF output also provides the different PDF qualities, but while there are different formats for these qualities in the case of image-only PDF, in DTXT the same output format should be selected and settings sepcify the given quality levels and determines if MRC is used. See the settings of DTXT image-on-text PDF output.

Compression methods:

Compression Quality:

Selector layer Foreground layer Background layer
Group4 JBIG2 Jpeg Jpeg2000 Jpeg Jpeg2000
FF_PDF_MRC_SUPERB lossless lossless 50% 50% 75% 75%
FF_PDF_MRC_GOOD lossless lossy not used not used 75% 75%
FF_PDF_MRC_MIN lossless lossy not used not used 50% 50%
Note:
The compression rate can not be set for JBIG2 compression.
Kernel.IMF.PDF.MRC.JPGQualityBGMin and Kernel.IMF.PDF.MRC.JPGQualityBGGood settings can be used to control background quality in PDF_MRC_MIN and PDF_MRC_GOOD modes, respectively.
Kernel.IMF.PDF.MRC.JPGQualityBGMax and Kernel.IMF.PDF.MRC.JPGQualityFGMax settings can be used to control background and foreground quality in PDF_MRC_SUPERB mode.

Resolution of the layers:

MRC Compression Resolution of the Current Image Resolution of the Selector Layer compared to the Current Image Resolution of the Foreground Layer compared to the Selector Layer Resolution of the Background Layer compared to the Selector Layer
FF_PDF_MRC_SUPERB 1, 1.5 or 2 depending on the current Resolution Enhancement setting (see note 1. below) same as in case of MRC_GOOD (see note 2. below) or 1 if resolution enhancement was not applied, otherwise 1/2 (see note 3. below)
FF_PDF_MRC_GOOD and FF_PDF_MRC_MIN <= 450 DPI 1/3 1/3
> 450 DPI 1/4
Note:
1. The II_BW image is used as Selector Layer. If the II_BW image does not exist it will be created using the current settings (see secondary image conversion).
2. Kernel.IMF.PDF.MRC.MaxEnableRR [default: true!] can be used to control application of resolution reduction in case of MRC_SUPERB similarly as in case of MRC_GOOD.
3. If resolution enhancement was not applied during the creation of the II_BW image the resolution of all three layers will be the same as the resolution of the current (II_CURRENT) image.
Otherwise the resolution of the FG and BG layers will be half of the Selector Layer. This means that if the ratio of the resolution enhancement is 2 then the resolution of the FG and BG layer will be the same as the resolution of the current image. If the ratio of the resolution enhancement is 1.5 then the resolution of the FG and BG layer will be the 3/4 of the resolution of the current image.

Saving MRC PDF files using the output converters (in RecAPIPlus)

When a document is saved in PDF formats, the image can be saved using MRC technology. In this case the same compression methods and compression quality are used and the resolution of the layers also will be the same as described previously.

The following converter settings affect the compression:

The following combinations can be used:

Compatibility UseMRC Compression.UseJBIG2 Compression.UseJPEG2000
R2ID_PDF_FORCESIZE R2ID_PDFMRC_MIN TRUE TRUE
R2ID_PDF_FORCEQUALITY R2ID_PDFMRC_NO
R2ID_PDF20 any possible value TRUE/FALSE TRUE/FALSE
R2ID_PDF17 any possible value TRUE/FALSE TRUE/FALSE
R2ID_PDF16 any possible value TRUE/FALSE TRUE/FALSE
R2ID_PDF15 any possible value TRUE/FALSE TRUE/FALSE
R2ID_PDF14 any possible value TRUE/FALSE FALSE
R2ID_PDF13 any possible value FALSE FALSE
R2ID_PDF12 and below R2ID_PDFMRC_NO
R2ID_PDFA (deprecated) any possible value TRUE/FALSE FALSE
R2ID_PDFA1B (instead of R2ID_PDFA) any possible value TRUE/FALSE FALSE
R2ID_PDFA2B any possible value TRUE/FALSE TRUE/FALSE
R2ID_PDFA3B any possible value TRUE/FALSE TRUE/FALSE
R2ID_PDFA2U any possible value TRUE/FALSE TRUE/FALSE
R2ID_PDFA3U any possible value TRUE/FALSE TRUE/FALSE
R2ID_PDFA1A any possible value TRUE/FALSE FALSE
R2ID_PDFA2A any possible value TRUE/FALSE TRUE/FALSE
R2ID_PDFA3A any possible value TRUE/FALSE TRUE/FALSE

The Pictures setting may modify the resolution of the layers. If the resolution of a layer is higher than the resolution specified by the Pictures setting, the layer will be transformed to the specified resolution, so this setting is suggested to leave in default state (R2_DPI_ORIGINAL) when saving MRC PDF.

The PictureColor setting may change the bit depth of the layers, so it is suggested to leave in default state (R2_BPP_ORIGINAL).

Improvements after CSDK 15

Generating editable output from PDF files has been speeded up – more advanced technology is applied to make zoning faster, achieve higher OCR accuracy, improve output quality and make the resulting files more usable when being further edited in target applications. This is achieved by creating two images whenever the input is a PDF or XPS file with a text layer. One is a composite image with all PDF information, the second contains only a background image without any text. This is especially useful for pages where text wraps around pictures irregularly, as shown. Further speed-up is achieved by assessing image quality and layout complexity. A faster OCR algorithm is now applied to high-quality pages with simple layouts.

wraparound.jpg

This technology cannot be applied to active PDF forms, and works only in accurate mode. In cases where this technology cannot be applied, there is an automatic and seamless fall-back to the old algorithm.

Another innovation is support for creating linearized PDF files. These are optimized for efficient web display. The resulting PDF adheres to Appendix F (Linearized PDF) of the PDF Reference. This means that after creating the PDF in the usual way (any PDF flavor), the CSDK reorders the file contents and adds hint tables. This means that the first page of the PDF will load quickly into a web page, with remaining pages loaded while it is being viewed. It means browsers can determine which page elements to present first (typically headings and texts) and which can follow (heavier pictures etc.). It also optimizes the efficiency of jumping to new pages in the PDF document.

Settings relating to the creation of linearized PDF are: Kernel.Imf.PDF.Linearized, Kernel.DTxt.PDF.Linearized and Converters.Text.PDF*.Linearized.

Linearized file creation works also with Asian-language PDF files.

Support is introduced for PDF version 1.6 and 1.7 – this includes support for the AES encryption system (128 and 256 bit). File opening is handled through existing mechanisms, while new saving options are provided for applying AES encryption to files.

Original orientation can be forced for PDF Searchable output

In response to client requests, it is now possible to have the original orientations conserved when outputting to PDF Searchable (Image-on-text) files. To implement this, the setting Kernel.Img.KeepOriginalImage or the corresponding function kRecSetPreserveOriginalImg must be used for each page involved. If these images are kept, the orientation on PDF Searchable output pages remains the same as that of the input, overriding any auto-rotation decisions that may have been performed during preprocessing.

RecPDF API for managing page-level manipulations of PDF files

This part of the SDK is an extension to KernelAPI and RecAPIPlus. It manages PDF files on the page level. It can copy, move, or delete pages of the PDF files. It is also able to extract information from them, or change their pages. RecPDF is a mostly operation based API. The page-level modifications are passed to the operation, and at the end the operation executes all of the changes at the same time. Operations can be cancelled as well if it turns out that no modification is needed. For more information see the documentation of the RecPDF Module.