How to Implement Document Detection in Python Using Dynamsoft Capture Vision SDK
Document detection is a key feature in many modern desktop, mobile and web applications. Recently, Dynamsoft unleashed its Python Capture Vision SDK, which provides a groundbreaking solution for developers working across Windows, Linux, and macOS. While the SDK also supports barcode and MRZ detection, this tutorial will focus on its robust document detection capabilities. We will walk you through how to integrate this feature into your Python projects, ensuring seamless and efficient document detection across platforms.
This article is Part 1 in a 3-Part Series.
Python Document Detection Demo on macOS
Prerequisites
-
Dynamsoft Capture Vision Trial License: Obtain a 30-Day trial license key for the Dynamsoft Capture Vision SDK.
- Tesseract OCR: Follow the official documentation to download and install Tesseract OCR on your machine.
-
Tessdata: Download the language models for the desired languages.
-
Python Packages: Install the required Python packages using the following commands:
pip install dynamsoft-capture-vision-bundle opencv-python pytesseract
What are these packages for?
dynamsoft-capture-vision-bundle
is the Dynamsoft Capture Vision SDK for Python.opencv-python
captures camera frames and displays processed image results.pytesseract
is a Python wrapper for Tesseract OCR. It invokes the pre-installed Tesseract OCR engine to recognize text orientation in images.
Getting Started with the Dynamsoft Python Capture Vision Example
To quickly learn the basic API usage of the Dynamsoft Python Capture Vision SDK, you can refer to the official example on GitHub. This example demonstrates how to detect document edges and perform perspective correction from a static image file.
from dynamsoft_capture_vision_bundle import *
import os
import sys
if __name__ == '__main__':
errorCode, errorMsg = LicenseManager.init_license("LICENSE-KEY")
if errorCode != EnumErrorCode.EC_OK and errorCode != EnumErrorCode.EC_LICENSE_CACHE_USED:
print("License initialization failed: ErrorCode:", errorCode, ", ErrorString:", errorMsg)
else:
cvr = CaptureVisionRouter()
while (True):
image_path = input(
">> Input your image full path:\n"
">> 'Enter' for sample image or 'Q'/'q' to quit\n"
).strip('\'"')
if image_path.lower() == "q":
sys.exit(0)
if image_path == "":
image_path = "../Images/document-sample.jpg"
if not os.path.exists(image_path):
print("The image path does not exist.")
continue
result = cvr.capture(image_path, EnumPresetTemplate.PT_DETECT_AND_NORMALIZE_DOCUMENT.value)
if result.get_error_code() != EnumErrorCode.EC_OK:
print("Error:", result.get_error_code(), result.get_error_string())
normalized_images_result = result.get_normalized_images_result()
if normalized_images_result is None or len(normalized_images_result.get_items()) == 0:
print("No normalized documents.")
else:
items = normalized_images_result.get_items()
print("Normalized", len(items), "documents.")
for index,item in enumerate(normalized_images_result.get_items()):
out_path = "normalizedResult_" + str(index) + ".png"
image_manager = ImageManager()
image = item.get_image_data()
if image != None:
errorCode, errorMsg = image_manager.save_to_file(image, out_path)
if errorCode == 0:
print("Document " + str(index) + " file: " + out_path)
input("Press Enter to quit...")
Explanation
- The
LicenseManager.init_license
method initializes the Dynamsoft Capture Vision SDK with a valid license key. - The
CaptureVisionRouter
class manages image processing tasks and coordinates image processing modules. Itscapture
method processes the input image and returns the result.
To better understand the example’s functionality, we can use OpenCV to visualize the detected document edges and the perspective-corrected document.
Converting Image Data to OpenCV Mat
In the above code, the image
object is an instance of the ImageData class. To convert it to an OpenCV Mat format, you can use the following code snippet:
def convertImageData2Mat(normalized_image):
ba = bytearray(normalized_image.get_bytes())
width = normalized_image.get_width()
height = normalized_image.get_height()
channels = 3
if normalized_image.get_image_pixel_format() == EnumImagePixelFormat.IPF_BINARY:
channels = 1
all = []
skip = normalized_image.stride * 8 - width
index = 0
n = 1
for byte in ba:
byteCount = 7
while byteCount >= 0:
b = (byte & (1 << byteCount)) >> byteCount
if index < normalized_image.stride * 8 * n - skip:
if b == 1:
all.append(255)
else:
all.append(0)
byteCount -= 1
index += 1
if index == normalized_image.stride * 8 * n:
n += 1
mat = np.array(all, dtype=np.uint8).reshape(height, width, channels)
return mat
elif normalized_image.get_image_pixel_format() == EnumImagePixelFormat.IPF_GRAYSCALED:
channels = 1
mat = np.array(ba, dtype=np.uint8).reshape(height, width, channels)
return mat
Obtaining the Corners of the Detected Document
Each item
is an instance of the NormalizedImageResultItem class. It contains the corners of the detected document. Use the following code snippet to extract the coordinates of the four corners:
location = item.get_location()
x1 = location.points[0].x
y1 = location.points[0].y
x2 = location.points[1].x
y2 = location.points[1].y
x3 = location.points[2].x
y3 = location.points[2].y
x4 = location.points[3].x
y4 = location.points[3].y
del location
Ensure that you use del location
to release the memory allocated for the location
object, preventing memory leaks by properly releasing the memory allocated for the C++ structure.
Drawing the Contours of the Detected Document
cv2.drawContours(cv_image, [np.intp([(x1, y1), (x2, y2), (x3, y3), (x4, y4)])], 0, (0, 255, 0), 2)
Drawing the Perspective-Corrected Document
cv2.imshow("Normalized Image", mat)
In common scanning scenarios, the document’s angle and orientation may deviate slightly, but this won’t affect the document’s direction after the perspective transformation. Here, we deliberately provide a document with a large rotation angle. After the perspective transformation, the document will rotate 180°. To solve this issue, we can use Tesseract’s text orientation detection to get the rotation angle and then adjust the document accordingly. In the next section, we will share the solution.
Correcting Document Orientation with Tesseract OCR
After installing pytesseract and the Tesseract OCR engine, you can use the following code snippet to detect the text orientation in the document:
osd_data = pytesseract.image_to_osd(
mat, output_type=Output.DICT)
print(osd_data)
rotation_angle = osd_data['rotate']
print(
f"Detected Character Orientation: {rotation_angle} degrees")
To improve the accuracy of text orientation detection, you need to download specific language models for Tesseract OCR. For example, to detect English text, download the eng model from the GitHub repository. Then copy the eng.traineddata
file to the tessdata
directory. After that, update the detection parameters with the language model:
osd_data = pytesseract.image_to_osd(
mat, lang='eng', output_type=Output.DICT)
Based on the detected rotation angle, you can adjust the document’s orientation:
if rotation_angle == 90:
mat = cv2.rotate(
mat, cv2.ROTATE_90_CLOCKWISE)
elif rotation_angle == 180:
mat = cv2.rotate(mat, cv2.ROTATE_180)
elif rotation_angle == 270:
mat = cv2.rotate(
mat, cv2.ROTATE_90_COUNTERCLOCKWISE)
cv2.imshow("Rotated Image", mat)
Real-time Document Detection via Webcam
Besides detecting documents from static images, scanning documents in real time through a camera’s video stream is also a common scenario. You can use OpenCV to capture video frames and process them with the Dynamsoft Capture Vision SDK. The full code is as follows:
from dynamsoft_capture_vision_bundle import *
import cv2
import numpy as np
import queue
from utils import *
class FrameFetcher(ImageSourceAdapter):
def has_next_image_to_fetch(self) -> bool:
return True
def add_frame(self, imageData):
self.add_image_to_buffer(imageData)
class MyCapturedResultReceiver(CapturedResultReceiver):
def __init__(self, result_queue):
super().__init__()
self.result_queue = result_queue
def on_captured_result_received(self, captured_result):
self.result_queue.put(captured_result)
if __name__ == '__main__':
errorCode, errorMsg = LicenseManager.init_license(
"LICENSE-KEY")
if errorCode != EnumErrorCode.EC_OK and errorCode != EnumErrorCode.EC_LICENSE_CACHE_USED:
print("License initialization failed: ErrorCode:",
errorCode, ", ErrorString:", errorMsg)
else:
vc = cv2.VideoCapture(0)
if not vc.isOpened():
print("Error: Camera is not opened!")
exit(1)
cvr = CaptureVisionRouter()
fetcher = FrameFetcher()
cvr.set_input(fetcher)
# Create a thread-safe queue to store captured items
result_queue = queue.Queue()
receiver = MyCapturedResultReceiver(result_queue)
cvr.add_result_receiver(receiver)
errorCode, errorMsg = cvr.start_capturing("Default")
if errorCode != EnumErrorCode.EC_OK:
print("error:", errorMsg)
while True:
ret, frame = vc.read()
if not ret:
print("Error: Cannot read frame!")
break
fetcher.add_frame(convertMat2ImageData(frame))
# Check if there are any new captured items from the queue
if not result_queue.empty():
captured_result = result_queue.get_nowait()
items = captured_result.get_items()
for item in items:
if item.get_type() == EnumCapturedResultItemType.CRIT_BARCODE:
text = item.get_text()
location = item.get_location()
x1 = location.points[0].x
y1 = location.points[0].y
x2 = location.points[1].x
y2 = location.points[1].y
x3 = location.points[2].x
y3 = location.points[2].y
x4 = location.points[3].x
y4 = location.points[3].y
cv2.drawContours(
frame, [np.intp([(x1, y1), (x2, y2), (x3, y3), (x4, y4)])], 0, (0, 255, 0), 2)
cv2.putText(frame, text, (x1, y1),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
del location
elif item.get_type() == EnumCapturedResultItemType.CRIT_NORMALIZED_IMAGE:
location = item.get_location()
x1 = location.points[0].x
y1 = location.points[0].y
x2 = location.points[1].x
y2 = location.points[1].y
x3 = location.points[2].x
y3 = location.points[2].y
x4 = location.points[3].x
y4 = location.points[3].y
cv2.drawContours(
frame, [np.intp([(x1, y1), (x2, y2), (x3, y3), (x4, y4)])], 0, (255, 0, 0), 2)
cv2.putText(frame, "Edge Detection", (x1, y1),
cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 0, 0), 2)
del location
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cv2.imshow('frame', frame)
cvr.stop_capturing()
vc.release()
cv2.destroyAllWindows()
Explanation
-
The
FrameFetcher
class implements theImageSourceAdapter
interface to add frame data to the built-in buffer. TheMat
object needs to be converted toImageData
before being added to the buffer.def convertMat2ImageData(mat): if len(mat.shape) == 3: height, width, channels = mat.shape pixel_format = EnumImagePixelFormat.IPF_RGB_888 else: height, width = mat.shape channels = 1 pixel_format = EnumImagePixelFormat.IPF_GRAYSCALED stride = width * channels imagedata = ImageData(mat.tobytes(), width, height, stride, pixel_format) return imagedata
-
The
MyCapturedResultReceiver
class implements theCapturedResultReceiver
interface. Theon_captured_result_received
method, running on a native C++ worker thread, returns the processed result to the main thread and stores it in a thread-safe queue. In the main thread, we can check the queue for new results and display them in the OpenCV window.
Running the Real-time Document Detection Demo on macOS
Source Code
https://github.com/yushulx/python-document-scanner-sdk/tree/main/examples/official