High Accuracy OCR Saving Operating Costs
Overview
This document shows how the Prime Recognition High Accuracy OCR engine can reduce the ongoing labor costs of OCR by 65% at a lower initial investment than conventional OCR.
OCR costs are highly variable based on the quality of images, labor rates, multiple shifts, etc. This analysis uses simplistic (and conservative) assumptions to make it easy to follow. For most applications, this analysis will understate the savings of using Prime Recognition’s OCR engine.
For example, this analysis uses a base configuration that generates 65% fewer errors. A “high end” version is available that can reduce errors by 82%.
Importance of OCR Cost Reduction
An image of a document, i.e., a piece of paper converted into pixels in computer memory, is worthless unless you also electronically capture information about the image’s content. This will allow later electronic retrieval.
Ideally, you would at least capture all the text that appears on the document. There is tremendous value in electronically capturing the text information of an image. This is evidenced by the fast growth of imaging systems in recent years for automated processing of insurance forms, medical claims, legal documents, and other types of data on paper.
Manual data entry is an accurate way to capture this data but very expensive because of the cost of labor. OCR is popular because it is usually significantly less expensive than manual data entry. However, OCR is less accurate than “triple key” data entry, even after OCR error correction.
Projects that do capture the full text of each page using OCR find that OCR error correction is typically 50-60% of the full imaging system’s cost! Because OCR is so expensive (and manual data entry is typically worse) the majority of imaging systems do not capture the full text of the page. Instead, OCR or manual data entry is used to capture one or several “indexes” for the page – keywords or phrases that they hope will allow them to retrieve this page in the future as needed. This is obviously not what users would prefer but unless their data is of very high value, e.g., medical claims processing, they cannot afford to perform full OCR today.
The calculations below show how Prime Recognition lowers the cost of OCR, both on the initial investment, and on the ongoing costs, while at the same time increasing the accuracy of the data going into the user’s application. Many more users will now be able to cost-effectively capture data from paper documents using OCR.
Conventional OCR
Assumptions
Average OCR accuracy rate is 98%
Notes
40 characters out of 2000 on a typical full text page will be wrong. This is a typical average error rate on “real world” documents in real production sites. Note that error rates are highly dependent on image quality.
OCR throughput is 420 characters/sec.
Assumes a 700 MHz Pentium PC.
60% of OCR errors are marked as “suspicious” characters.
“Suspicious” characters are reviewed by data entry clerks to find and correct OCR errors. Errors that are not marked as suspicious – 40% of all errors – do not get reviewed, and are included in the final output. Users must use logical checks in their mainframe, database, or other target application to find and reject the data that includes errors (if possible)
Number of correct characters marked as suspicious is 2.5 times the total number of OCR errors.
The total number of suspicious characters marked is highly variable between OCR engines and is configurable. This number is an average across the top 5 OCR engines and represents the setting that finds the most errors but also marks the most correct characters.
Data entry clerk time:
0.5 seconds per suspicious character that is a correct character.
1.5 seconds per suspicious
character that is an error.
Most analysts quote a simpler number of 5 seconds per error. This number includes suspicious character processing and error correction. Prime Recognition’s number works out to 2.75 seconds per error.
Calculations
1 second of conventional OCR generates:
- 410 characters
- 8.4 errors (420 char/sec * 2% error rate i.e., 98% accuracy rate)
- 5.0 of the errors are marked as suspicious (8.4 errors * 60% marked)
- 21.0 suspicious characters which are correct (8.4 errors * 2.5)
- 7.50 seconds of error correction time (5.0 errors * 1.5 sec per error)
- 10.5 seconds of checking suspicious characters (21.0 char * .5 sec/char)
Or in other words:
1 second of conventional OCR processing generates 18 seconds of editing time (7.50+10.5) and 3.4 errors (8.4-5.0) that get past manual error correction.
Key Benefits
- Lowers OCR errors by 65%
- Lowers OCR “suspicious” characters marked for clerical labor review by 65%
- Data has 65% fewer OCR errors AFTER manual error correction.
Key Cost
- 3.3 times slower than conventional OCR
Prime Recognition High Accuracy OCR Calculations
Prime Recognition will take 3.3 seconds to produces 420 characters, and it will generate the following:
- 3.0 errors (65% fewer errors)
- 1.8 of the errors are marked as suspicious
- (60% of errors, the same ratio as conventional OCR but on a 65% smaller base)
- 7.4 suspicious characters which are correct (65% fewer suspicious characters)
- 2.70 seconds of error correction time (1.8 errors * 1.5 sec per error)
- 3.70 seconds of checking suspicious characters (7.4 * 0.5 sec per char)
Or in other words:
For the same throughput (420 characters) the Prime Recognition High Accuracy OCR engine generates 6.4 seconds of editing time (2.70+3.70) and 1.2 errors (3.0-1.8) that get past manual error correction.
Summary
Conventional OCR | Prime Recognition OCR | |
---|---|---|
OCR time | 1.0 second | 3.3 seconds |
Error Correction time | 18.0 seconds | 6.4 seconds |
Total Processing time | 19.0 seconds | 9.7 seconds |
Errors left in data after manual error correction | 3.4 errors | 1.2 errors |
Financial Calculations
Assumptions | Notes |
---|---|
Cost of PC is $5000 | This is simplistic because an OCR PC typically sits in a closet with no or inexpensive monitor/graphics adapter. An editing workstation, on the other hand, requires a large screen and sophisticated graphics adapter, plus chairs/desks/cubicles for the data entry clerk. |
Cost of data entry clerk is $20.00 per hour | Direct hourly rate is $9.75 per hour and the remainder is the overhead costs of labor, including fringe benefits, sick time, vacation time, personal time, direct supervisory salaries, human resource and accounting overhead, cost of real estate per person, etc. |
Cost of Prime Recognition software per PC is $14,940 | This is a “loaded” version. Versions of PrimeOCR exist that cost $5,000. |
Cost of Conventional OCR and Editing Workstations software per PC is $8,000 |
Assume a system which requires the throughput of one conventional OCR package running on one PC.
Conventional OCR
Capital Costs
OCR PC | $5,000 | 1 station * $5,000 |
OCR S/W | $8,000 | 1 station * $8,000 |
Error Correction PCs | $90,000 | Ratio of error correction time to OCR time is 18 seconds to 1 second, therefore 18 stations * $5,000 |
Error Correction S/W | $144,000 | 18 stations * $8,000 |
Ongoing Costs (per year)
Data Entry Clerks | $691,200 | 18 clerks * $20/hour * 8 hours/day * 240 work days per year |
Prime Recognition High Accuracy OCR
Capital Costs
OCR PC | $16,500 | 3.3 stations * $5,000 |
OCR S/W | $49,500 | 3.3 stations * $14,940 |
Error Correction PCs | $31,000 | Ratio of error correction time to OCR time is 6.4 seconds to 3.3 second, therefore 6.4 stations * $5,000 |
Error Correction S/W | $51,200 | 6.4 stations * $8,000 |
Ongoing Costs (per year)
Data Entry Clerks | $245,800 | 6.4 clerks * $20/hour * 8 hours/day * 240 work days per year |
Summary
Conventional OCR | Prime Recognition OCR | |
---|---|---|
Capital Costs | $247,000 | $148,200 |
Ongoing Costs | $691,200 | $245,800 |
Total Processing time | 19.0 seconds | 9.7 seconds |
Cost of errors left in data | Not Quantified | Not Quantified |
Conclusions
1. Prime Recognition’s OCR engines create lower capital costs. You must look beyond the cost of the OCR engines and include the costs of manual error correction workstations.
2. Prime Recognition’s OCR engines create dramatically lower labor costs on an ongoing basis. Prime Recognition’s cost advantage will increase over time as the cost of PC power decreases by 25% per year and labor costs increase.
3. Prime Recognition generates 65% fewer errors that get past manual error correction (up to 82% with added options). This saves costs in applications that are sensitive to errors in data. For example, some applications use mainframe, database, or other application logic to reject data with errors. These rejects then must be manually checked to see if the errors were caused by OCR or some other source, and fixed.
4. The analysis above assumes that users want accurate data, and hence they use manual error correction to clean up data. This assumption applies to most applications. Some applications are purportedly not sensitive to errors in the data, e.g., full text searches with the new “fuzzy” search engines, so users are contemplating using OCR but without manual error correction. However, even fuzzy searches assume a significant level of accuracy in the data. If the error rate goes beyond that, perhaps on bad quality pages, those pages may not be found by electronic retrieval. Each user will have to decide how much risk they want to incur, e.g., is it OK if only 95% of the relevant documents show up in a search?
Prime Recognition offers a lower cost product (as low as $1,300) for applications that do not want to manually correct OCR errors. This engine is lower cost because it does not need to generate the information required by error correction software, such as character confidence levels, and suspicious character image “bounding boxes”.