Size Optimization of Scanned Documents
Scanning documents into a digital form can help companies save physical space, improve data retrieval, reduce costs, improve collaboration, etc.
Sometimes, we may come across very large scanned document files or files scanned with a low visual quality. In this article, we are going to talk about aspects related to the size optimization of scanned documents.
Samples using Dynamic Web TWAIN, a JavaScript library which enables documents scanning from browsers, are created for demonstration.
Getting Started With Dynamic Web TWAIN
How Image Sizes are Calculated
Before we get started, let’s learn about how image sizes are calculated.
image bytes = width * height * bit depth / 8
Basically, it is affected by the number of pixels and how many colors it can represent.
DPI
DPI stands for dots per inch and is a measure of the resolution of a scanned document or photo. The higher the DPI, the higher the quality and resolution of your scan and the resulting image.
Generally, scanning a document at about 300 DPI produces an image with a reasonable size and good quality.
In Dynamic Web TWAIN, we can specify the DPI in the device configuration:
DWObject.AcquireImageAsync({IfShowUI:false,Resolution:300});
Here is a test result of scanning documents with different DPIs.
DPI | Resolution | Size |
---|---|---|
100 | 850x1100 | 2741.41KB |
200 | 1700x2200 | 10957.03KB |
300 | 2550x3300 | 24659.77KB |
600 | 5100x6600 | 98613.28KB |
Bit Depth
Bit depth, also known as color depth, is the number of bits used to indicate the color of a single pixel. The bigger the bit depth, the more colors a pixel can represent.
Most document scanners provide the option to scan documents in different color modes: black & white, gray and color.
Their bit depths are as follows:
- black & white: 1 bit.
- gray: 8 bit.
- color: 24 bit.
In Dynamic Web TWAIN, we can specify the color mode (or pixel type) in the device configuration:
let pixelType = Dynamsoft.DWT.EnumDWT_PixelType.TWPT_BW;
DWObject.AcquireImageAsync({IfShowUI:false,PixelType:pixelType});
We can also set the bit depth with the following code:
let imageIndex = 0;
let bitDepth = 4;
let highQuality = false;
DWObject.ChangeBitDepth(imageIndex,bitDepth,highQuality);
Here is a test result of scanning documents with different color modes.
Pixel Type | Resolution | Bitdepth | Size |
---|---|---|---|
B&W | 2550x3507 | 1 | 1095.94KB |
Gray | 2550x3507 | 8 | 8740.10KB |
Color | 2550x3507 | 24 | 26206.61KB |
Image Compression
We can compress the image data with different image file formats like JPEG and PNG.
In Dynamic Web TWAIN, we can get the image sizes with the following code:
let imageIndex = 0;
let width = DWObject.GetImageWidth(imageIndex);
let height = DWObject.GetImageHeight(imageIndex);
let originalSize = DWObject.GetImageSize(imageIndex,width,height);
let size = DWObject.GetImageSizeWithSpecifiedType(imageIndex,j); //size after compression with a format
Here is a test result of scanning documents in different color modes and image formats.
Pixel Type | Resolution | Bitdepth | Size | Format | Compression Rate |
---|---|---|---|---|---|
B&W | 2550x3507 | 1 | 1096.00KB | BMP | 0% |
B&W | 2550x3507 | 1 | 705.89KB | JPG | 35.59% |
B&W | 2550x3507 | 1 | 38.09KB | TIF | 96.52% |
B&W | 2550x3507 | 1 | 77.45KB | PNG | 92.93% |
Gray | 2550x3507 | 8 | 8741.15KB | BMP | 0% |
Gray | 2550x3507 | 8 | 590.68KB | JPG | 93.24% |
Gray | 2550x3507 | 8 | 1957.32KB | TIF | 77.61% |
Gray | 2550x3507 | 8 | 1574.74KB | PNG | 81.98% |
Color | 2550x3507 | 24 | 26206.66KB | BMP | 0% |
Color | 2550x3507 | 24 | 665.43KB | JPG | 97.46% |
Color | 2550x3507 | 24 | 5112.49KB | TIF | 80.49% |
Color | 2550x3507 | 24 | 3753.82KB | PNG | 85.68% |
We can draw some points from the table:
- BMP is a lossless format which does not compress the image.
- JPEG does not work well for black & white images but it has the best compression rate for gray and color images.
Multiple Page Optimization
Most of the time, we need to scan documents in multiple pages. We can use TIFF or PDF as the container to save the images in one file. TIFF and PDF support many image formats and compression methods.
Since different compression methods work differently for different color modes, Dynamic Web TWAIN uses the following compression strategies to achieve an optimum result.
For TIFF, it uses the following strategy:
- For 1-bit images, use TIFF_T6.
- For other images, use TIFF_LZW.
For PDF, it uses the following strategy:
- For 1-bit images, if the PDF version is over 1.4, use JBIG2 encoding, otherwise, use FAX4 (CCITT Group 4 Fax).
- For 8-bit images, if the image is grayscale, use JPEG encoding, otherwise, use LZW (Lempel-Ziv-Welch).
- For 24-bit and 32-bit images, use JPEG encoding.
Here is a test result of scanning documents in TIFF and PDF formats. It scans a document in black & white, gray and color.
Format | Original Size | Size | Compression Rate | Link |
---|---|---|---|---|
TIFF | 36042.64KB | 7109.21KB | 80.28% | Download |
36042.64KB | 1283.96KB | 96.44% | Download |
Source Code
You can find all the code and online demos in the following repo: