Convert Scanned Documents to Searchable PDF in Java

Dec 04, 2023

In the previous article, we built a JavaFX demo app to scan documents using Dynamsoft Service’s REST API. The demo app can scan documents via protocols like TWAIN, WIA, SANE, and ICA and save the documents into a PDF file using PDFBox.

Screenshot

In this article, we are going to extend its features to convert the scanned documents into a searchable PDF.

When opening a searchable PDF, we can select text and search keywords directly. Generating a searchable PDF is very useful in a document indexing or management system.

If the PDF is generated with tools like InDesign and Word, its text is already searchable. But if the PDF contains scanned images, we have to add an extra text overlay to make it searchable.

OCR of Scanned Documents

There are various OCR engines or API services we can use. Here, we use OCRSpace’s free OCR API.

Create related definitions.

TextLine.class:

public class TextLine {
    public double left;
    public double top;
    public double width;
    public double height;
    public String text;

    public TextLine(double left, double top, double width, double height, String text) {
        this.left = left;
        this.top = top;
        this.width = width;
        this.height = height;
        this.text = text;
    }
}

OCRResult.class:

public class OCRResult {
    public ArrayList<TextLine> lines = new ArrayList<TextLine>();
}

Create an OCRSpace class which provides a static method to get the OCR result from a base64-encoded image.

public class OCRSpace {
    public static String key = "";
    public static String lang = "eng";
    /**
     * Get OCR result from a base64-encoded image in JPEG format
     *
     * @param base64 - base64-encoded image
     *
     */
    public static OCRResult detect(String base64) throws IOException {
        OCRResult result = new OCRResult();
        OkHttpClient client = new OkHttpClient.Builder()
                .connectTimeout(120, TimeUnit.SECONDS)
                .build();
        RequestBody requestBody=new FormBody.Builder()
                .add("apikey",key)
                .add("language",lang)
                .add("base64Image","data:image/jpeg;base64,"+base64.trim())
                .add("isOverlayRequired","true")
                .build();

        Request httpRequest = new Request.Builder()
                .url("https://api.ocr.space/parse/image")
                .post(requestBody)
                .build();
        try (Response response = client.newCall(httpRequest).execute()) {
            try {
                String json = response.body().string();
                parse(json,result);
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
        return result;
    }

    private static void parse(String json,OCRResult ocrResult) throws JsonProcessingException {
        ObjectMapper objectMapper = new ObjectMapper();
        Map<String,Object> body = objectMapper.readValue(json,new TypeReference<Map<String,Object>>() {});
        List<Map<String,Object>> parsedResults = (List<Map<String, Object>>) body.get("ParsedResults");
        for (Map<String,Object> parsedResult:parsedResults) {
            Map<String,Object> textOverlay = (Map<String, Object>) parsedResult.get("TextOverlay");
            List<Map<String,Object>> lines = (List<Map<String, Object>>) textOverlay.get("Lines");
            for (Map<String,Object> line:lines) {
                TextLine textLine = parseAsTextLine(line);
                ocrResult.lines.add(textLine);
            }
        }
    }

    private static TextLine parseAsTextLine(Map<String,Object> line){
        String lineText = (String) line.get("LineText");
        List<Map<String,Object>> words = (List<Map<String, Object>>) line.get("Words");
        int minX = (int)((double) words.get(0).get("Left"));
        int minY = (int)((double) words.get(0).get("Top"));
        int maxX = 0;
        int maxY = 0;
        for (Map<String,Object> word:words) {
            int x = (int)((double) word.get("Left"));
            int y = (int)((double) word.get("Top"));
            int width = (int)((double) word.get("Width"));
            int height = (int)((double) word.get("Height"));
            minX = Math.min(minX,x);
            minY = Math.min(minY,y);
            maxX = Math.max(maxX,x+width);
            maxY = Math.max(maxY,y+height);
        }
        return new TextLine(minX,minY,maxX - minX,maxY-minY,lineText);
    }
}

Here, we use OKHttp for HTTP requests and Jackson as the JSON library.

Add Text Overlay to a PDF Page

Create a SearchablePDFCreator class for related methods.
```
public class SearchablePDFCreator {}
```

Add an addTextOverlay method to add text overlay to an existing PDF page.

/**
 * Add text overlay to an existing PDF page
 * @param contentStream - PDF content stream
 * @param result - OCR result
 * @param pageHeight - Height of the image
 * @param pdFont - Specify a font for evaluation of the position
 * @param percent - image's height / page's height
 */
public static void addTextOverlay(PDPageContentStream contentStream,OCRResult result, double pageHeight, PDFont pdFont,double percent) throws IOException {
    PDFont font = pdFont;
    contentStream.setFont(font, 16);
    contentStream.setRenderingMode(RenderingMode.NEITHER);
    for (int i = 0; i <result.lines.size() ; i++) {
        TextLine line = result.lines.get(i);
        FontInfo fi = calculateFontSize(font,line.text, (float) (line.width * percent), (float) (line.height * percent));
        contentStream.beginText();
        contentStream.setFont(font, fi.fontSize);
        contentStream.newLineAtOffset((float) (line.left * percent), (float) ((pageHeight - line.top - line.height) * percent));
        contentStream.showText(line.text);
        contentStream.endText();
    }
}

private static FontInfo calculateFontSize(PDFont font, String text, float bbWidth, float bbHeight) throws IOException {
    int fontSize = 17;
    float textWidth = font.getStringWidth(text) / 1000 * fontSize;
    float textHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;

    if(textWidth > bbWidth){
        while(textWidth > bbWidth){
            fontSize -= 1;
            textWidth = font.getStringWidth(text) / 1000 * fontSize;
            textHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;
        }
    }
    else if(textWidth < bbWidth){
        while(textWidth < bbWidth){
            fontSize += 1;
            textWidth = font.getStringWidth(text) / 1000 * fontSize;
            textHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;
        }
    }

    FontInfo fi = new FontInfo();
    fi.fontSize = fontSize;
    fi.textHeight = textHeight;
    fi.textWidth = textWidth;

    return fi;
}

The font size is automatically calculated based on the font specified and the line’s width.

Add an addPage method to add text overlay along with the image as a new page to a document.

public static void addPage(byte[] imageBytes,OCRResult result, PDDocument document,int pageIndex,PDFont pdFont) throws IOException {
    ByteArrayInputStream in = new ByteArrayInputStream(imageBytes);
    BufferedImage bi = ImageIO.read(in);
    // Create a new PDF page
    PDRectangle rect = new PDRectangle((float) bi.getWidth(),(float) bi.getHeight());
    PDPage page = new PDPage(rect);
    document.addPage(page);
    PDPageContentStream contentStream = new PDPageContentStream(document, page);
    PDImageXObject image
            = PDImageXObject.createFromByteArray(document,imageBytes,String.valueOf(pageIndex));
    contentStream.drawImage(image, 0, 0);
    addTextOverlay(contentStream,result,bi.getHeight(),pdFont);
    contentStream.close();
}

Let’s examine the result.

Using RenderingMode.NEITHER will make the text layer invisible. We can comment out this line to see the text overlayed. The following is a region of a PDF file with the text overlay. We can see that the text fits the image closely.

text overlay

Save a Scanned Document Image into a Searchable PDF

Next, we can try to use the classes we just wrote to create a searchable PDF from an image.

File image = new File("F://WebTWAINImage.jpg");
byte[] byteArray = new byte[(int) image.length()];
try (FileInputStream inputStream = new FileInputStream(image)) {
    inputStream.read(byteArray);
} catch (IOException e) {
    throw new RuntimeException(e);
}
String base64 = Base64.getEncoder().encodeToString(byteArray);
OCRSpace.key = "your key";
OCRResult result = OCRSpace.detect(base64);
PDDocument document = new PDDocument();
SearchablePDFCreator.addPage(byteArray,result,document,0);
document.save(new File("F://output.pdf"));
document.close();

Enhance the JavaFX Demo with the Searchable PDF Creator

Add the library as a dependency editing pom.xml.

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>com.github.tony-xlh</groupId>
        <artifactId>searchablePDF4j</artifactId>
        <version>1.0.0</version>
    </dependency>
</dependencies>

Add a checkbox to enable searchable PDF generation in the UI.

If the checkbox is selected, generate a searchable PDF by adding a text overlay.

 PDDocument document = new PDDocument();
 int index = 0;
 for (DocumentImage di: documentListView.getItems()) {
     index = index + 1;
     ImageView imageView = di.imageView;
     PDRectangle rect = new PDRectangle((float) imageView.getImage().getWidth(),(float) imageView.getImage().getHeight());
     System.out.println(rect);
     PDPage page = new PDPage(rect);
     document.addPage(page);
     PDPageContentStream contentStream = new PDPageContentStream(document, page);
     PDImageXObject image
             = PDImageXObject.createFromByteArray(document,di.image,String.valueOf(index));
     contentStream.drawImage(image, 0, 0);
+    if (searchablePDFCheckBox.isSelected()) {
+        String base64 = Base64.getEncoder().encodeToString(di.image);
+        OCRSpace.key = "your key";
+        OCRResult result = OCRSpace.detect(base64);
+        SearchablePDFCreator.addTextOverlay(contentStream,result,image.getHeight());
+    }
     contentStream.close();
 }
 document.save(fileToSave.getAbsolutePath());
 document.close();

Source Code

Check out the code to have a try: