How to OCR PDF in a .NET Desktop Application

Last Updated on 2018-10-10

Convert a PDF file to an image and read text from the image in C# / VB.NET

Recently, several customers have asked us if it’s possible to use Dynamic .NET TWAIN to convert a PDF file to an image, and then extract text from it, all in a .NET desktop app. In this article I’ll provide some samples that show how to do this using Dynamic .NET TWAIN with our PDF Rasterizer and OCR add-ons. The solution works in both WinForms and WPF applications.

All the samples provided below (both C# and VB.NET) are included in the 30-day free trial installer of Dynamic .NET TWAIN.
Get 30-day free trial now

Read a PDF file and convert it to image

PDF Rasterizer

With the PDF Rasterizer add-on of Dynamic .NET TWAIN, you can load PDF files from your local disk as images, then display them in the Dynamic .NET TWAIN viewer. You can also specify a resolution for the image.

C# code snippet
[csharp] private void btnLoadPDF_Click(object sender, EventArgs e)
{
try
{
OpenFileDialog openfiledlg = new OpenFileDialog();
openfiledlg.Filter = "PDF|*.PDF";
openfiledlg.FilterIndex = 0;
openfiledlg.Multiselect = true;

if (openfiledlg.ShowDialog() == DialogResult.OK)
{
foreach (string strfilename in openfiledlg.FileNames)
{
this.dynamicDotNetTwain1.ConvertPDFToImage(strfilename, float.Parse(cmbPDFResolution.SelectedItem.ToString()));
}
}
}
catch(Exception exp)
{
MessageBox.Show(exp.Message);
}
}
[/csharp]

Perform OCR

With the OCR add-on of Dynamic .NET TWAIN, you can extract text from images and save the OCR result in a text  or PDF file (text or image over text) in your .NET application.  The .NET OCR SDK supports  40+ languages, including Arabic, Chinese and more.

C# code snippet
[csharp]private void OCR(bool isOcrOnRectangleArea)
{
string languageFolder = m_strCurrentDirectory;
if (m_bSamplesExist)
{
languageFolder = m_strCurrentDirectory + @"Samples\Bin\";
}
//specify the tessdata folder with the language package
this.dynamicDotNetTwain1.OCRTessDataPath = languageFolder;
//specify the language for OCR
this.dynamicDotNetTwain1.OCRLanguage = languages[this.cbxOCRLanguage.Text];
//specify the file format to store OCR result: text file, text or image over text PDF file
this.dynamicDotNetTwain1.OCRResultFormat = (Dynamsoft.DotNet.TWAIN.OCR.ResultFormat)this.ddlResultFormat.SelectedIndex;

string strDllPath = m_strCurrentDirectory;
if (m_bSamplesExist)
{
strDllPath = m_strCurrentDirectory + @"Redistributable\OCRResources\";
}

this.dynamicDotNetTwain1.OCRDllPath = strDllPath;

if (this.dynamicDotNetTwain1.CurrentImageIndexInBuffer < 0)
{
MessageBox.Show("Please load an image before doing OCR!", "Index out of bounds", MessageBoxButtons.OK, MessageBoxIcon.Warning);
return;
}

byte[] sbytes = null;
if (!isOcrOnRectangleArea)
sbytes = this.dynamicDotNetTwain1.OCR(this.dynamicDotNetTwain1.CurrentSelectedImageIndicesInBuffer);
else
sbytes = this.dynamicDotNetTwain1.OCR(dynamicDotNetTwain1.CurrentImageIndexInBuffer, int.Parse(tbxLeft.Text),
int.Parse(tbxTop.Text), int.Parse(tbxRight.Text), int.Parse(tbxButtom.Text));

if (sbytes != null && sbytes.Length > 0)
{
SaveFileDialog filedlg = new SaveFileDialog();
if (this.ddlResultFormat.SelectedIndex != 0)
{
filedlg.Filter = "PDF File(*.pdf)| *.pdf";
}
else
{
filedlg.Filter = "Text File(*.txt)| *.txt";
}
if (filedlg.ShowDialog() == DialogResult.OK)
{
File.WriteAllBytes(filedlg.FileName, sbytes);
}
}
else
{
if(this.dynamicDotNetTwain1.ErrorCode != 0)
MessageBox.Show(this.dynamicDotNetTwain1.ErrorString);
}
}
[/csharp]

Get samples

You can download and try the .NET PDF samples above from the Dynamsoft sample gallery.

1

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe Newsletter

Subscribe to our mailing list to get the monthly update.

Subscribename@email.com