MRC Compression for PDF Files

Reduce the Size of Color Images and Image-Only PDFs by 90%

Image segmentation is a process to partition the pixels of an image into different classes, with each class carrying the coherent characteristics. The characteristics could be color, texture, and intensity. In the document digitization field, image segmentation is commonly used to partition an image into text blocks, line-art graphics, images, and the background (texture, shadows from poor lighting, etc).

MCR PDF Compression

What is MRC?

MRC stands for mixed raster content which is an application of image segmentation. MRC is a method to compress images which contain both binary-compressible text and continuous-tone components. [1]

A document image can be represented in different layers:

  • The first layer is the foreground layer, which stores the colors of the text blocks and line-art graphics.
  • The second layer is the text blocks and line-art graphics without the color info.
  • The third layer is the background and the images, which typically makes up a majority of the image size.

Based on the compressibility characteristics, different compression algorithms can be applied to each layer, leading to an optimal compression ratio up to 10 without degrading the visual quality.

  • JBIG2 is an efficient algorithm for compressing text blocks and graphics.
  • For the background and the colored images, JPEG and JPEG 2000 can achieve a fair level of compression without losing the smoothness and accuracy of colors.

The three layers, along with the instructions on how to reassemble and render them in one file, are stored within a single file.

File Formats that Support MRC Compression

PDF is the most common file format that supports MRC compression.

The MRC 3-layer model was also implemented in other digital document formats, including .tfx (TIFF-FX), .ldx (LuraDocument), and .djvu (DjVu). [2]

Benefits of MRC Compression

MRC compression was developed initially to compress scanned color pages/images for fax transmission. [2] Nowadays, the method is used on images from document scans and snapshots from cameras as well.

The major benefit of Mixed Raster Content compression is obvious - a smaller size. This reduces the bandwidth in transit and thus speeds up file transmission. Also, a smaller file size leads to decreased storage space in the database.

The MRC method comes with some beneficial side effects too:

  • Sharper text. With the text layer separated from the foreground and the background, it’s possible to sharpen the text to make it easier to read.
  • Cleaned-up background. The practice of 3-layer segmentation helps with cleaning up the background as well. Texture and shadows on the background could be distracting to readers. Removing the shadows or softening the texture can improve the reading experience.

MRC Improves the Accuracy of OCR

Based on the discussion above, we can understand that Mixed Raster Content compression results in clean and isolated text blocks. This improves the accuracy of the downstream OCR. With OCR turning the image PDF into a searchable PDF, employee efficiencies are further improved.

Support for MRC

Dynamic Web TWAIN will introduce the MRC compression feature soon. Stay tuned.

References

[1] https://en.wikipedia.org/wiki/Mixed_raster_content
[2] http://www.djvu-soft.narod.ru/planetdjvu/the_mrc_mixed_raster_content_model_and_djvu.htm