Size Optimization of Scanned Documents

Oct 30, 2023

Scanning documents into a digital form can help companies save physical space, improve data retrieval, reduce costs, improve collaboration, etc.

Sometimes, we may come across very large scanned document files or files scanned with a low visual quality. In this article, we are going to talk about aspects related to the size optimization of scanned documents.

Samples using Dynamic Web TWAIN, a JavaScript library which enables documents scanning from browsers, are created for demonstration.

Getting Started With Dynamic Web TWAIN

How Image Sizes are Calculated

Before we get started, let’s learn about how image sizes are calculated.

image bytes = width * height * bit depth / 8

Basically, it is affected by the number of pixels and how many colors it can represent.

DPI

DPI stands for dots per inch and is a measure of the resolution of a scanned document or photo. The higher the DPI, the higher the quality and resolution of your scan and the resulting image.

Generally, scanning a document at about 300 DPI produces an image with a reasonable size and good quality.

In Dynamic Web TWAIN, we can specify the DPI in the device configuration:

DWObject.AcquireImageAsync({IfShowUI:false,Resolution:300});

Here is a test result of scanning documents with different DPIs.

DPI	Resolution	Size
100	850x1100	2741.41KB
200	1700x2200	10957.03KB
300	2550x3300	24659.77KB
600	5100x6600	98613.28KB

Bit Depth

Bit depth, also known as color depth, is the number of bits used to indicate the color of a single pixel. The bigger the bit depth, the more colors a pixel can represent.

Most document scanners provide the option to scan documents in different color modes: black & white, gray and color.

Their bit depths are as follows:

black & white: 1 bit.
gray: 8 bit.
color: 24 bit.

In Dynamic Web TWAIN, we can specify the color mode (or pixel type) in the device configuration:

let pixelType = Dynamsoft.DWT.EnumDWT_PixelType.TWPT_BW;
DWObject.AcquireImageAsync({IfShowUI:false,PixelType:pixelType});

We can also set the bit depth with the following code:

let imageIndex = 0;
let bitDepth = 4;
let highQuality = false;
DWObject.ChangeBitDepth(imageIndex,bitDepth,highQuality);

Here is a test result of scanning documents with different color modes.

Pixel Type	Resolution	Bitdepth	Size
B&W	2550x3507	1	1095.94KB
Gray	2550x3507	8	8740.10KB
Color	2550x3507	24	26206.61KB

Image Compression

We can compress the image data with different image file formats like JPEG and PNG.

In Dynamic Web TWAIN, we can get the image sizes with the following code:

let imageIndex = 0;
let width = DWObject.GetImageWidth(imageIndex);
let height = DWObject.GetImageHeight(imageIndex);
let originalSize = DWObject.GetImageSize(imageIndex,width,height);
let size = DWObject.GetImageSizeWithSpecifiedType(imageIndex,j); //size after compression with a format

Here is a test result of scanning documents in different color modes and image formats.

Pixel Type	Resolution	Bitdepth	Size	Format	Compression Rate
B&W	2550x3507	1	1096.00KB	BMP	0%
B&W	2550x3507	1	705.89KB	JPG	35.59%
B&W	2550x3507	1	38.09KB	TIF	96.52%
B&W	2550x3507	1	77.45KB	PNG	92.93%
Gray	2550x3507	8	8741.15KB	BMP	0%
Gray	2550x3507	8	590.68KB	JPG	93.24%
Gray	2550x3507	8	1957.32KB	TIF	77.61%
Gray	2550x3507	8	1574.74KB	PNG	81.98%
Color	2550x3507	24	26206.66KB	BMP	0%
Color	2550x3507	24	665.43KB	JPG	97.46%
Color	2550x3507	24	5112.49KB	TIF	80.49%
Color	2550x3507	24	3753.82KB	PNG	85.68%

We can draw some points from the table:

BMP is a lossless format which does not compress the image.
JPEG does not work well for black & white images but it has the best compression rate for gray and color images.

Multiple Page Optimization

Most of the time, we need to scan documents in multiple pages. We can use TIFF or PDF as the container to save the images in one file. TIFF and PDF support many image formats and compression methods.

Since different compression methods work differently for different color modes, Dynamic Web TWAIN uses the following compression strategies to achieve an optimum result.

For TIFF, it uses the following strategy:

For 1-bit images, use TIFF_T6.
For other images, use TIFF_LZW.

For PDF, it uses the following strategy:

For 1-bit images, if the PDF version is over 1.4, use JBIG2 encoding, otherwise, use FAX4 (CCITT Group 4 Fax).
For 8-bit images, if the image is grayscale, use JPEG encoding, otherwise, use LZW (Lempel-Ziv-Welch).
For 24-bit and 32-bit images, use JPEG encoding.

Here is a test result of scanning documents in TIFF and PDF formats. It scans a document in black & white, gray and color.

Format	Original Size	Size	Compression Rate	Link
TIFF	36042.64KB	7109.21KB	80.28%	Download
PDF	36042.64KB	1283.96KB	96.44%	Download

Source Code

You can find all the code and online demos in the following repo:

https://github.com/tony-xlh/scan-optimization/

LANGUAGES

PLATFORMS

FEATURED