Data capture, character recognition OCR ICR OMR CHR BCR, image processing, forms data capture, document indexing, automatic data extraction Data capture, character recognition OCR ICR OMR CHR BCR, image processing, forms data capture, document indexing, automatic data extraction
FreeForm

FreeForm

The technology that we call Free-Form represents the new frontier of data capture of documents to free structure.

This technologies to data capture of structured modules are almost able from several years and have reached a very high degree of maturity, laying the groundwork for new challenges: the data capture documents at free structure is one of them.

If a structured document is any type of module in which the positions of the data to be extracted are precise and known in advance, an unstructured document is instead a document in which there are, however, very precise data, but their position and the their layout is not known a priori and can vary greatly between the document and the document of the same typology.

The most classic example of unstructured document in which it is very easy to come across on a daily is represented by bills: although we know a priori that each invoice is the business name of the supplier, the date, the number progressive, the taxable, the VAT and the total, we can not know in advance where these data are located. In fact, their position is not standardized but it is left to the free will of each supplier that you can choose to use fonts, graphics elements, colors and shadows as they see fit.

One of the possible strategies to deal with these types of documents is to be traced back to the case of homogeneous structured documents, where possible. For example, continuing to talk about the bills, you might create a specific template to associate with the invoices of each vendor, so that once identified the supplier, the invoice can be treated in an appropriate way.

This approach can be good when the number of classes is not high and when the process of classification can be done accurately, whether performed directly by software or manually by an operator. We must therefore prepare to worry about the different template to quote and be certain that they are processed only documents related to them.

In contrast to other types of unstructured documents, for example the curricula, this type of strategy is not applicable.

The approach that is used to solve this problem, rather than starting from a spatial definition, part by a logical definition of the data. In practice, the data to read are defined, and then identified by a series of specific attributes, such as, for example, key words next to them, formatting type awaited, relative position, presence or absence of graphical elements, the criteria of cross-validation check, and so on.

In the case of VAT as an invoice, for example, will be able to recognize it, and then obtain the value, instructing the system to find a sequence of 11 numeric characters (or 2 letters followed by 11 numeric), near (above, below, right, left) of the words "VAT", perhaps limited to a certain area of ​​the document (for example in the top half of the image), verifying the checksum and, if possible, in the presence of a possible database of suppliers.

In practice, the software instructs you to "think" like humans do: in fact, when we look on a bill given the TOTAL DOCUMENT we are naturally inclined to look at the bottom right of the sheet, maybe we focus on a box particularly evident or marked and try as "test" the words "TOTAL DOCUMENT" O "INVOICE AMOUNT" or "TOT. INVOICE". In the same way it acts a system for processing of unstructured documents: this is based on our information, on the basis of the rules properly reset, which must then be defined in a precise and exhaustive.

Example of unstructured document recognition (Invoices): notice how the same 'date' field has been identified in completely different positions

Example of unstructured document recognition (Invoices): notice how the same "date" field has been identified in completely different positions.


The basis of these features is the use of optical character recognition (OCR) of the entire document together with a robust algorithm of layout analysis: the combined use of these two tools makes it possible to identify blocks of text, vertical lines, horizontal and text elements with their confidences, with the possibility of verifying whether or not the logical conditions imposed on the research data on the page.

To make it even more accurate processing of unstructured documents is also possible to combine the two strategies described above: if the system is able to associate the document to be treated to a template known, is treated as a structured document, otherwise it is treated as a document unstructured and processed equally.

The freeform data capture allow then to extract data from any type of documents.

For more information on the FreeForm technology, it is worthwhile to know how and know our solutions that implement it, you can send us an e-mail to informazioni@recogniform.it or fill in the form below.


Company
Title
First Name
Last Name
Address
Zip Code
State
Country
Phone
Fax
E-mail
Message

Taking note of Information of the policy of personal data (D. Lgs 30 june 2003 n.196 and subsequent amendment and additions), click on the "OK" button i consent to collect, hold, process, communicate, and if appropriate, discontinue the treatment/s of personal data that concern me, for the purposes specified in the policy.