How to OCR PDF in a .NET Desktop Application
Convert a PDF file to an image and read text from the image in C# / VB.NET Recently, several customers have asked us if it’s possible to use Dynamic .NET TWAIN to convert a PDF file to an image, and then extract text from it, all in a .NET desktop app. In this article I’ll provide some samples that show how to do this using Dynamic .NET TWAIN with our PDF Rasterizer and OCR add-ons. The solution works in both WinForms and WPF applications. All the samples provided below (both C# and VB.NET) are included in the 30-day free trial installer of Dynamic .NET TWAIN. Get 30-day free trial now
Read a PDF file and convert it to image
With the PDF Rasterizer add-on of Dynamic .NET TWAIN, you can load PDF files from your local disk as images, then display them in the Dynamic .NET TWAIN viewer. You can also specify a resolution for the image. C# code snippet
private void btnLoadPDF_Click(object sender, EventArgs e)
{ try { OpenFileDialog openfiledlg = new OpenFileDialog(); openfiledlg.Filter = "PDF|*.PDF"; openfiledlg.FilterIndex = 0; openfiledlg.Multiselect = true; if (openfiledlg.ShowDialog() == DialogResult.OK)
{ foreach (string strfilename in openfiledlg.FileNames)
{ this.dynamicDotNetTwain1.ConvertPDFToImage(strfilename, float.Parse(cmbPDFResolution.SelectedItem.ToString())); } } }
catch(Exception exp) { MessageBox.Show(exp.Message); } }
Perform OCR
With the OCR add-on of Dynamic .NET TWAIN, you can extract text from images and save the OCR result in a text or PDF file (text or image over text) in your .NET application. The .NET OCR SDK supports 40+ languages, including Arabic, Chinese and more. C# code snippet
private void OCR(bool isOcrOnRectangleArea) { string languageFolder = m_strCurrentDirectory; if (m_bSamplesExist) { languageFolder = m_strCurrentDirectory + @"Samples\Bin\"; }
//specify the tessdata folder with the language package
this.dynamicDotNetTwain1.OCRTessDataPath = languageFolder;
//specify the language for OCR
this.dynamicDotNetTwain1.OCRLanguage = languages[this.cbxOCRLanguage.Text];
//specify the file format to store OCR result: text file, text or image over text PDF file
this.dynamicDotNetTwain1.OCRResultFormat = (Dynamsoft.DotNet.TWAIN.OCR.ResultFormat)
this.ddlResultFormat.SelectedIndex;
string strDllPath = m_strCurrentDirectory;
if (m_bSamplesExist) { strDllPath = m_strCurrentDirectory + @"Redistributable\OCRResources\"; }
this.dynamicDotNetTwain1.OCRDllPath = strDllPath;
if (this.dynamicDotNetTwain1.CurrentImageIndexInBuffer < 0)
{ MessageBox.Show("Please load an image before doing OCR!", "Index out of bounds", MessageBoxButtons.OK, MessageBoxIcon.Warning); return; }
byte[] sbytes = null;
if (!isOcrOnRectangleArea) sbytes = this.dynamicDotNetTwain1.OCR(this.dynamicDotNetTwain1.CurrentSelectedImageIndicesInBuffer);
else sbytes = this.dynamicDotNetTwain1.OCR(dynamicDotNetTwain1.CurrentImageIndexInBuffer, int.Parse(tbxLeft.Text), int.Parse(tbxTop.Text), int.Parse(tbxRight.Text), int.Parse(tbxButtom.Text));
if (sbytes != null && sbytes.Length > 0) { SaveFileDialog filedlg = new SaveFileDialog();
if (this.ddlResultFormat.SelectedIndex != 0) { filedlg.Filter = "PDF File(*.pdf)| *.pdf"; }
else { filedlg.Filter = "Text File(*.txt)| *.txt"; }
if (filedlg.ShowDialog() == DialogResult.OK) { File.WriteAllBytes(filedlg.FileName, sbytes); } }
else { if(this.dynamicDotNetTwain1.ErrorCode != 0)
MessageBox.Show(this.dynamicDotNetTwain1.ErrorString); } }
Get samples
You can download and try the .NET PDF samples above from the Dynamsoft sample gallery.