High Accuracy OCR Saving Operating Costs

Overview

This document shows how the Prime Recognition High Accuracy OCR engine can reduce the ongoing labor costs of OCR by 65% at a lower initial investment than conventional OCR.

OCR costs are highly variable based on the quality of images, labor rates, multiple shifts, etc. This analysis uses simplistic (and conservative) assumptions to make it easy to follow.  For most applications, this analysis will understate the savings of using Prime Recognition’s OCR engine.

For example, this analysis uses a base configuration that generates 65% fewer errors.  A “high end” version is available that can reduce errors by 82%.

Importance of OCR Cost Reduction

An image of a document, i.e., a piece of paper converted into pixels in computer memory, is worthless unless you also electronically capture information about the image’s content. This will allow later electronic retrieval.

Ideally, you would at least capture all the text that appears on the document. There is tremendous value in electronically capturing the text information of an image. This is evidenced by the fast growth of imaging systems in recent years for automated processing of insurance forms, medical claims, legal documents, and other types of data on paper.

Manual data entry is an accurate way to capture this data but very expensive because of the cost of labor. OCR is popular because it is usually significantly less expensive than manual data entry. However, OCR is less accurate than “triple key” data entry, even after OCR error correction.

Projects that do capture the full text of each page using OCR find that OCR error correction is typically 50-60% of the full imaging system’s cost! Because OCR is so expensive (and manual data entry is typically worse) the majority of imaging systems do not capture the full text of the page. Instead, OCR or manual data entry is used to capture one or several “indexes” for the page – keywords or phrases that they hope will allow them to retrieve this page in the future as needed. This is obviously not what users would prefer but unless their data is of very high value, e.g., medical claims processing, they cannot afford to perform full OCR today.

The calculations below show how Prime Recognition lowers the cost of OCR, both on the initial investment, and on the ongoing costs, while at the same time increasing the accuracy of the data going into the user’s application. Many more users will now be able to cost-effectively capture data from paper documents using OCR.

Conventional OCR

Assumptions

Average OCR accuracy rate is 98%

Notes

40 characters out of 2000 on a typical full text page will be wrong. This is a typical average error rate on “real world” documents in real production sites. Note that error rates are highly dependent on image quality.

OCR throughput is 420 characters/sec.

Assumes a 700 MHz Pentium  PC.

60% of OCR errors are marked as “suspicious” characters.

“Suspicious” characters are reviewed by data entry clerks to find and correct OCR errors. Errors that are not marked as suspicious – 40% of all errors – do not get reviewed, and are included in the final output. Users must use logical checks in their mainframe, database, or other target application to find and reject the data that includes errors (if possible)

Number of correct characters marked as suspicious is 2.5 times the total number of OCR errors.

The total number of suspicious characters marked is highly variable between OCR engines and is configurable. This number is an average across the top 5 OCR engines and represents the setting that finds the most errors but also marks the most correct characters.

Data entry clerk time:

0.5 seconds per suspicious character that is a correct character.
1.5 seconds per suspicious
character that is an error.

Most analysts quote a simpler number of 5 seconds per error. This number includes suspicious character processing and error correction. Prime Recognition’s number works out to 2.75 seconds per error.

Calculations

1 second of conventional OCR generates:

  • 410 characters
  • 8.4 errors (420 char/sec * 2% error rate i.e., 98% accuracy rate)
  • 5.0 of the errors are marked as suspicious (8.4 errors * 60% marked)
  • 21.0 suspicious characters which are correct (8.4 errors * 2.5)
  • 7.50 seconds of error correction time (5.0 errors * 1.5 sec per error)
  • 10.5 seconds of checking suspicious characters (21.0 char * .5 sec/char)

Or in other words:

1 second of conventional OCR processing generates 18 seconds of editing time (7.50+10.5) and 3.4 errors (8.4-5.0) that get past manual error correction.

Key Benefits

  • Lowers OCR errors by 65%
  • Lowers OCR “suspicious” characters marked for clerical labor review by 65%
  • Data has 65% fewer OCR errors AFTER manual error correction.

Key Cost

  • 3.3 times slower than conventional OCR

Prime Recognition High Accuracy OCR Calculations

Prime Recognition will take 3.3 seconds to produces 420 characters, and it will generate the following:

  • 3.0 errors (65% fewer errors)
  • 1.8 of the errors are marked as suspicious
  • (60% of errors, the same ratio as conventional OCR but on a 65% smaller base)
  • 7.4 suspicious characters which are correct (65% fewer suspicious characters)
  • 2.70 seconds of error correction time (1.8 errors * 1.5 sec per error)
  • 3.70 seconds of checking suspicious characters (7.4 * 0.5 sec per char)

Or in other words:

For the same throughput (420 characters) the Prime Recognition High Accuracy OCR engine generates 6.4 seconds of editing time (2.70+3.70) and 1.2 errors (3.0-1.8) that get past manual error correction.

Summary

Conventional OCRPrime Recognition OCR
OCR time1.0 second3.3 seconds
Error Correction time18.0 seconds6.4 seconds
Total Processing time19.0 seconds9.7 seconds
Errors left in data after manual error correction3.4 errors1.2 errors

Financial Calculations

AssumptionsNotes
Cost of PC is $5000This is simplistic because an OCR PC typically sits in a closet with no or inexpensive monitor/graphics adapter. An editing workstation, on the other hand, requires a large screen and sophisticated graphics adapter, plus chairs/desks/cubicles for the data entry clerk.
Cost of data entry clerk is $20.00 per hourDirect hourly rate is $9.75 per hour and the remainder is the overhead costs of labor, including fringe benefits, sick time, vacation time, personal time, direct supervisory salaries, human resource and accounting overhead, cost of real estate per person, etc.
Cost of Prime Recognition software per PC is $14,940This is a “loaded” version. Versions of PrimeOCR exist that cost $5,000.
Cost of Conventional OCR and Editing
Workstations software per PC is $8,000

Assume a system which requires the throughput of one conventional OCR package running on one PC.

Conventional OCR

Capital Costs

OCR PC$5,0001 station * $5,000
OCR S/W$8,0001 station * $8,000
Error Correction PCs$90,000Ratio of error correction time to OCR time is 18 seconds to 1 second, therefore 18 stations * $5,000
Error Correction S/W$144,00018 stations * $8,000

Ongoing Costs (per year)

Data Entry Clerks$691,20018 clerks * $20/hour * 8 hours/day * 240 work days per year

Prime Recognition High Accuracy OCR

Capital Costs

OCR PC$16,5003.3 stations * $5,000
OCR S/W$49,5003.3 stations * $14,940
Error Correction PCs$31,000Ratio of error correction time to OCR time is 6.4 seconds to 3.3 second, therefore 6.4 stations * $5,000
Error Correction S/W$51,2006.4 stations * $8,000

Ongoing Costs (per year)

Data Entry Clerks$245,8006.4 clerks * $20/hour * 8 hours/day * 240 work days per year

Summary

Conventional OCRPrime Recognition OCR
Capital Costs$247,000$148,200
Ongoing Costs$691,200$245,800
Total Processing time19.0 seconds9.7 seconds
Cost of errors left in dataNot QuantifiedNot Quantified

Conclusions

1. Prime Recognition’s OCR engines create lower capital costs. You must look beyond the cost of the OCR engines and include the costs of manual error correction workstations.

2. Prime Recognition’s OCR engines create dramatically lower labor costs on an ongoing basis. Prime Recognition’s cost advantage will increase over time as the cost of PC power decreases by 25% per year and labor costs increase.

3. Prime Recognition generates 65% fewer errors that get past manual error correction (up to 82% with added options). This saves costs in applications that are sensitive to errors in data. For example, some applications use mainframe, database, or other application logic to reject data with errors. These rejects then must be manually checked to see if the errors were caused by OCR or some other source, and fixed.

4. The analysis above assumes that users want accurate data, and hence they use manual error correction to clean up data. This assumption applies to most applications. Some applications are purportedly not sensitive to errors in the data, e.g., full text searches with the new “fuzzy” search engines, so users are contemplating using OCR but without manual error correction. However, even fuzzy searches assume a significant level of accuracy in the data. If the error rate goes beyond that, perhaps on bad quality pages, those pages may not be found by electronic retrieval. Each user will have to decide how much risk they want to incur, e.g., is it OK if only 95% of the relevant documents show up in a search?

Prime Recognition offers a lower cost product (as low as $1,300) for applications that do not want to manually correct OCR errors. This engine is lower cost because it does not need to generate the information required by error correction software, such as character confidence levels, and suspicious character image “bounding boxes”.