What is OCR and Ideal Settings to Maximize Extraction for Meaningful Data
There are a myriad of technology enablers to help organizations go paperless. Among the top considerations is optical character recognition technology (OCR technology). OCR is a powerful tool that is often part of any document management system (DMS). But, how can organizations ensure they get the most out of it?
What’s All the Fuss About OCR
In a nutshell, OCR lets you take files with text on them, that would otherwise only be good for reading with your eyes, and convert them into digital meaningful data. Put another way, it takes text that is in a physical document, image or photo and converts it to text that can be used in a digital word processor. This means you can now search a document and manipulate the content. Imagine you have a book of 300 pages and you need to find everywhere that the word “document” exists in that book. You’d need to read the entire book, probably twice, and make a list of where the words exist. Instead, you can use OCR to scan the book and convert it to text that can be searched, the search then takes seconds to do. This provides various other advantages. If you convert the above scenario into a customer service moment, you start to see why OCR is used in business. Let’s say you’re an insurance provider and need to find all patients in the past year that had knee surgery. This used to mean lots of time going through paper files. But, if those files are scanned and OCR is implemented, you can create a list of those patients in just a few seconds. As a result, there is an obvious improvement to productivity and a boon to how you can use data. There are also cost-saving advantages. OCR is relatively inexpensive. It’s often bundled with scanner software. So, in many cases you might already have it. By using OCR, businesses can reduce various copying, filing, printing, shipping and other related tasks behind paperwork. It also improves security. Paper is easier to misplace or have stolen compared to a server using password and encryption security. Paper can also be permanently destroyed by fires, water damage, etc. Many businesses don’t make redundant backup copies of their paper-based files. But electronic files can be backed up to remote sites.
There are many other benefits to OCR. To point out a few just for finance and accounting, Infinit Accounting illustrates how it can be used to capture employee receipts to reduce fraud. They also point out how OCR can be used in Accounts Payable for improved workflow. To realize these and other full benefits of OCR, there are essential settings that must be ensured.
Image Settings for Best OCR Extraction Results
You can forget about how OCR provides meaningful data use if you can’t even extract text correctly. So, first you must make sure your basic settings for using OCR are spot on. First up, resolution. A typical font size is often at least 10 points. So, assuming this we usually want 300 DPI resolution. You might need a higher dpi for smaller font sizes, up to 600 dpi. Typically, an OCR module will allow for selecting three color modes when scanning. This can be for an OCR online or OCR software module. These are usually black and white, grayscale and color. Generally, using grayscale is best. Black and white can also work for most text documents if it’s a highly legible font and font size. But, beware of using black and white when image quality is not very high. In contrast, grayscale keeps significantly more details than black and white and is typically the best option. If your document contains colored images and you need to save the colors, then obviously your go-to choice will be to scan in color mode.
Another consideration is image compression, of which there are two types: lossy and lossless. Lossless compression is the option to go with for better OCR recognition. It basically means there is no compression applied to an image, thus it maintains its original resolution. You also must consider the document file types. You can choose to save scanned images in uncompressed TIFF or PNG formats. Doing so allows for better processing options in the future. This is because lossy file formats, such as JPEGs, lose quality with each save. Finally, brightness settings will have an impact. By adjusting the brightness, you will get lighter or darker images. A medium brightness value of 50 percent is best in most cases. These settings are useful for OCR on PDF or other document types. OCR technology has been around for a while now. It’s a mature technology with good accuracy. It has demonstrable productivity and other benefits for business. It’s why businesses rely on it more and more. But, users must ensure best practices in the basics of OCR settings to ensure it does what you want it to do.