How to Scan Documents to Searchable PDF Files in a Web Application

scan documents to searchable pdf

Last Updated on 2018-10-08

Scan Paper Documents into Searchable PDF Files With OCR in a Web Application

A typical step in document management is to scan paper documents to make them image-based PDFs and to save them in your document repository. But, these PDFs are not searchable or editable. They are essentially like photos. It’s inconvenient if you want to search the scanned document or edit part of the content. It also costs more space to store scanned image files.

To save storage space, and more importantly to improve work efficiency, you will need an OCR engine to convert scanned image-based PDFs to text-based files.

In this article, we will highlight how to use our Dynamic Web TWAIN SDK to scan paper documents in JavaScript and then easily save them as searchable and editable PDF files using the OCR Pro engine.

If you want to jump ahead for a minute and try scanning documents as searchable and editable PDF files, do so here:
Try an Online Demo

If you want to go ahead and embed our module in your web application to scan documents as searchable and editable PDF files, get our 30-day free trial of Dynamic Web TWAIN here:
Download 30-day free trial

Scan Documents

Dynamic Web TWAIN provides simple APIs which enable you to easily control TWAIN scanners. You can use the following code to acquire images from your scanners. For better OCR results, here we scan the documents in grayscale, with a resolution of 300 dpi.
(Learn more: Recommended Scan Settings for the Best OCR Accuracy)

[javascript]function acquireImage() {
DWObject.SelectSourceByIndex(document.getElementById("source").selectedIndex); //select an available TWAIN scanners

//set scanning settings like pixel type, resolution, ADF ect.
DWObject.IfShowUI = false; //don’t show the user interface of the scanner
DWObject.PixelType = 1; //scan in gray
DWObject.Resolution = 300;
DWObject.IfFeederEnabled = true; //scan from auto feeder
DWObject.IfDuplexEnabled = false;
DWObject.IfDisableSourceAfterAcquire = true;

//acquire images from scanners

Perform OCR and Save it to Searchable PDF

Now we need to perform OCR on the images to extract text and save them as searchable and editable PDF files.

Dynamic Web TWAIN provides an OCR Professional add-on which enables you to extract text from images and make them real text. You can do both server-side OCR and client-side with the OCR engine.

Include the JS client of OCR Pro in the head.

<script type="text/javascript" language="javascript" src="Resources/addon/dynamsoft.webtwain.addon.ocrpro.js">

Download the OCR Pro DLL to the client machine to perform OCR recognition on the client side.

[javascript]var CurrentPathName = unescape(location.pathname);
CurrentPath = CurrentPathName.substring(0, CurrentPathName.lastIndexOf("/") + 1);
DWObject.Addon.OCRPro.Download(CurrentPath + "Resources/addon/", OnSuccess, OnFailure);

Below is the code snippet for doing client-side OCR with Dynamsoft’s OCR Professional Module.

[javascript]function GetOCRProInfo(result) {
var bRet = "";
for (var i = 0; i < pageCount; i++) {
var page = result.GetPageContent(i);
var letterCount = page.GetLettersCount();
for (var n = 0; n < letterCount; n++) { var letter = page.GetLetterContent(n); bRet += letter.GetText(); } } console.log(bRet); //Get OCR result. var fileName = "ocrresult.pdf"; //save the OCR result as PDF var savePath = "D:\temp\" + fileName; if(savePath.length > 1)
DWObject.Addon.OCRPro.Recognize(0, GetOCRProInfo, GetErrorInfo); // 0 is the index of the image

As always, send us any questions you have directly or in the comments section below.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe Newsletter

Subscribe to our mailing list to get the monthly update.