RecAPI
|
Data collection from filled forms with CSV output. Form processing is supported on: Windows, Linux, Mac OS X. More...
Functions | |
RECERR RECAPIPLS | RecProcessFormPagesPDF (int sid, LPCTSTR sampleFormFile, LPCTSTR pageRange, LPCTSTR *inputFormFiles, LPCTSTR outoutDTXTFile) |
It collects data from input form files. | |
RECERR RECAPIPLS | RecProcessFormPagesTemplate (int sid, LPCTSTR *inputFormTemplFiles, LPCTSTR *inputFormFiles, LPCTSTR outoutDTXTFile) |
It collects data from input form files. |
Data collection from filled forms with CSV output. Form processing is supported on: Windows, Linux, Mac OS X.
The Toolkit already provides form handling capabilities, using Logical Form Recognition® Technologies from Nuance. This allows form templates to be created and/or designed, so that sets of forms can be processed.
A new type of form handling is introduced since CSDK 16, called Form Data Extraction (FDE). It can be considered as a simplified more direct workflow for extracting and collating form data. The following table compares the two offerings:
Item | LFR | FDE |
Template source | Any supported image file type or by scanning a paper form. PDF files as input are treated as image-only. Only one page can be processed at a time. The form must be unfilled. | Must be an active PDF form - single or multi-page, filled or unfilled. |
Template page range | If used, must specify a single page. | Can specify any number of the existing pages; but must harmonize with the forms to be processed (see below). |
Template design | Form objects can be auto-detected and/or manually added, deleted or modified. | No changes permitted, the specified template file must be suitable for the task. |
Form field types | Check boxes, circle texts, comb fields, tables/cells, graphics, lines and text boxes. | Check boxes, text boxes, option (radio) buttons. |
Form field names | Can be set with kRecSetFormFieldName | Must be pre-defined as meta-data in the PDF template form. |
Anchors | Four pre-defined form controls set in template as anchors, must appear on all filled forms. | Automatic. Four text strings identified from fixed text on template, searched on all forms being processed. |
Forms being processed |
|
|
Multi-page forms | Can be handled, but a separate template is needed for each page; the application and end user must ensure template/form matching. | Can be handled – the page range for the template must be in harmony with the number of pages in forms being processed (see below). |
OCR usage during processing | Used only for data extraction, and only as necessary. | Used twice, to find anchor points and for data extraction - only when a usable text layer is not detected. |
Resizing tolerance | 10% (was 1.5% previously) | 10% or more. |
Recognition restrictions | Regular expressions or conditions like ‘Numbers only’ can be set for each form control. | No restriction of field data. |
Output |
| CSV Text only: by default data from all forms enters one file: each form becomes a row, each field a column. |
Upside-down pages | Auto-orientation should be able to resolve such cases. | Auto-orientation is on by default and should correct such errors. |
Error handling | There are three levels:
| The same three levels exist, but in general the program decides which to apply, within the general workflow error handling system. Since FDE processing is more limited than LFR, there is a lower likelihood of serious errors arising. |
Neither FDE nor LFR are designed to handle Asian (including CCJK, Arabic, Thai and Hebrew) language forms.
To summarize, Form Data Extraction allows data to be extracted from sets of forms, and collated into a comma separated text file (CSV) that can be opened in database programs where each form is represented by a worksheet row and each detected form control becomes a worksheet column.
A form template must be specified for each form type to be handled – in addition to a file name, a page range can be specified. This template file must be an active PDF form – it can be single- or multi-page, filled or unfilled. It must contain active tagged form controls – these can be text boxes, check boxes and option (radio) buttons. The form field names (labels) must be defined in the PDF template form; they will appear as the column headers in the target application. The PDF output converters can save such active PDF forms. This feature can be controlled through the settings PDFForms and PDFFormVisuality. Nuance PDF Converter Professional 5 can also generate such active PDF forms.
The forms to be processed must have a layout and content corresponding to the defined form template (page size, number of pages per form, location of controls, etc.) The forms can be:
A page range can be useful to exclude pages with form filling instructions or other unneeded content. It can be specified for the FDE template file and it must harmonize with the forms that are later processed, as shown in the following example for the page range 3-5:
p1 and 2 | p3 to 5 | p6 and on | |
Active PDF form template file | excluded | in range | excluded |
PDF/XPS files with text layers * | must exist | must exist | need not exist |
Scanned forms and image files ** | no | yes | no |
(*) That means Normal or Searchable PDF or XPS files and includes Active PDF forms
(**) In other words if the template defines a three-page form, each scanned filled form must contain three pages, each in the correct order. The same applies to image files, but the pages can be in three single-page files, one three-page file or any other combination (1+2 or 2+1). A set of forms can be presented in a single multi-page file, so long as each form contains three pages in the correct order. If a mismatch is detected, the program attempts to match the template to neighboring pages and may be able to continue processing.
Multiple page ranges are also acceptable, e.g. 3-5, 8, 11-14. In that case all PDF/XPS forms with a text layer must have all the pages corresponding to the template, up to and including the last validated template page.
For FDE the first task is to prepare a suitable template to be selected in step two of the following procedure. FDE processing is performed through workflows; only three steps are allowed:
RECERR RECAPIPLS RecProcessFormPagesPDF | ( | int | sid, |
LPCTSTR | sampleFormFile, | ||
LPCTSTR | pageRange, | ||
LPCTSTR * | inputFormFiles, | ||
LPCTSTR | outoutDTXTFile | ||
) |
It collects data from input form files.
This function collects data from a set of filled forms for further processing in databases or spreadsheets. The layout and location of form elements is defined by a sample form file, which is used for generating form templates. The forms to be processed must be filled by computer or similar machine and not handwritten. The output is a CSV file.
The sample file must be an active, non-image-only PDF form containing suitable Acro-Form controls for form objects. It can be either filled or unfilled. It can be a multi-page form and a page range can be specified to eliminate non-form pages such as filling instructions, etc.
[in] | sid | Setting Collection. |
[in] | sampleFormFile | Name of the sample form file. |
[in] | pageRange | Page range specifying which pages should be processed. This is a string that contains a decimal number (e.g. "4"), or two numbers separated by a '-' sign (e.g. "3-5"), or their arbitrary comma-separated combinations (e.g. "3-5,8,11-14"). It may be NULL , which means all the pages are selected. |
[in] | inputFormFiles | Pointers of the names of the form files to be processed. The latest pointer has to be NULL . |
[in] | outoutDTXTFile | Name of the output CSV file. |
ZONE_SIZE_WARN | At least one zone was truncated, because it extends beyond the image. |
ZONE_SIZE_ERR | At least one zone was not loaded, because it extends beyond the image. |
IMG_ANCHOR_WARN | Some of the anchors were not found. |
IMG_ANCHOR_NOT_FOUND | No anchor was found or the sample form does not contain anchor zones. |
REC_OK | Successful. |
RECERR RecProcessFormPagesPDF(int sid, string sampleFormFile, string pageRange, string[] inputFormFiles, string outputDTXTFile);
RECERR RECAPIPLS RecProcessFormPagesTemplate | ( | int | sid, |
LPCTSTR * | inputFormTemplFiles, | ||
LPCTSTR * | inputFormFiles, | ||
LPCTSTR | outoutDTXTFile | ||
) |
It collects data from input form files.
This function collects data from a set of filled forms for further processing in databases or spreadsheets. The layout and location of form elements is defined by form template files. The forms to be processed must be filled by computer or similar machine and not handwritten. The output is a CSV.
[in] | sid | Setting Collection. |
[in] | inputFormTemplFiles | Pointers of the names of the form template files. The latest pointer has to be NULL . |
[in] | inputFormFiles | Pointers of the names of the form files to be processed. The latest pointer has to be NULL . |
[in] | outoutDTXTFile | Name of the output CSV file. |
ZONE_SIZE_WARN | At least one zone was truncated, because it extends beyond the image. |
ZONE_SIZE_ERR | At least one zone was not loaded, because it extends beyond the image. |
IMG_ANCHOR_WARN | Some of the anchors were not found. |
IMG_ANCHOR_NOT_FOUND | No anchor was found or the template does not contain anchor zones. |
REC_OK | Successful. |
RECERR RecProcessFormPagesTemplate(int sid, string[] inputFormTemplFiles, string[] inputFormFiles, string outputDTXTFile);