Data capture, character recognition OCR ICR OMR CHR BCR, image processing, forms data capture, document indexing, automatic data extraction Data capture, character recognition OCR ICR OMR CHR BCR, image processing, forms data capture, document indexing, automatic data extraction
OCR - Optical Character Recognition

OCR - Optical Character Recognition

The OCR (Optical Character Recognition), which is used for the recognition of printed or typed text, it is probably the best known of technologies used into data capture.

The OCR recognition systems can be divided into monofont, multifont and omnifont.

The OCR monofont are those who fail to recognize a single family of characters, but with the greatest accuracy possible. They are used for example to data read printed with OCR A and OCR B, specially designed for the data capture, or to read data that is printed with CMC7 and E13B font, designed to be read magnetically and optically recognizable then only with specific systems.

Example of font OCR A for which is require a specific engine OCR

Example of font OCR A for which is require a specific engine OCR

The OCR omnifont are those who theoretically fail to recognize any font family, then regardless of the type, size and attributes (bold, italic, underline), font with which they are printed texts.

The OCR multifont are a middle ground between the above: able to recognize printed characters with more than one font, but only if the specified font is among those expected.

Basically, the work that makes an OCR system can be divided into three steps: segmentation, feature extraction and classification.

The first step, the segmentation is to identify the characters to be read: if the area to be read is broad or is step of a whole page, you may also need an analysis phase of the layout, which is to identify columns, paragraphs, text lines and words.

The second step is the extraction of characteristics of any single segmented character. To extract this characteristics exist a different mode (for example histograms of the frequencies of pixels), other proceedings of geometric type (for example curvature and direction of the lines). In any case, the basic idea is that these characteristics are presents into a single character of same type, should be sufficient to identify it and that they should tolerate noise and distortions introduced in the scanning process.

The third step is that of classification: the extracted features are analyzed in order to go back to the character based on his form, using a priori knowledge-based prototype of the set of characters to be recognized.

An example of the OCR executed on serial numbers of banknotes: note how for each character is also given the confidence of reading

An example of the OCR executed on serial numbers of banknotes: note how for each character is also given the confidence of reading.

The result of this process is that attribute to any character of ASCII (or UNICODE) code that represents and, possibly a confidence of reading that indicates how the system is sure to have correctly classified data.

To make minimize possible errors due to OCR readable in some systems it is possible to do in order to restrict the set of characters that you would expect to find: the so that, for example, if you need to read consist only of numbers data, OCR will not confuse from zero with the letters "O" or the "2" with the letter "Z", and so on.

While OCR technology is very advanced, sometimes the results can be below expectations due to poor print quality or poor quality scanning of documents processed: the major problems of recognition, in fact, meet where most characters found to stick together them, or where a single character is fragmented into several parts.

A poor quality printing or scanning can affect the operation of an OCR system because it is difficult to recognize the sequences of characters strung together or single fragmented.

A poor quality printing or scanning can affect the operation of an OCR system because it is difficult to recognize the sequences of characters strung together or single fragmented.

In special cases, to maximize the performance of reading, you can use the technology of voting that is to use multiple simultaneous recognition engines, deciding which one character is the most reliable information with a technique called reconciliation of the results.

This system has the advantage of being able to drastically reduce the amount of errors of interpretation, but has the disadvantage of increasing the processing time, the complexity of the data capture system and its costs.

Our products that implement the OCR technology

For more information on the OCR technology, it is worthwhile to know how and know our solutions that implement it, you can send us an e-mail to informazioni@recogniform.it or fill in the form below.


Company
Title
First Name
Last Name
Address
Zip Code
State
Country
Phone
Fax
E-mail
Message

Taking note of Information of the policy of personal data (D. Lgs 30 june 2003 n.196 and subsequent amendment and additions), click on the "OK" button i consent to collect, hold, process, communicate, and if appropriate, discontinue the treatment/s of personal data that concern me, for the purposes specified in the policy.