How to Convert Scanned Documents to Searchable PDF in Java with OCR

In the previous article, we built a JavaFX demo app to scan documents using Dynamic Web TWAIN Service’s REST API. The demo app can scan documents via protocols like TWAIN, WIA, SANE, and ICA and save the documents into a PDF file using PDFBox.

Screenshot

In this article, we are going to extend its features to convert the scanned documents into a searchable PDF.

When opening a searchable PDF, we can select text and search keywords directly. Generating a searchable PDF is very useful in a document indexing or management system.

If the PDF is generated with tools like InDesign and Word, its text is already searchable. But if the PDF contains scanned images, we have to add an extra text overlay to make it searchable.

What you’ll build: A Java application that uses OCR (via OCRSpace API) and PDFBox to convert scanned document images into fully searchable PDF files with an invisible text overlay.

Key Takeaways

  • Scanned PDFs become searchable by adding an invisible OCR text overlay on top of each page image using PDFBox’s RenderingMode.NEITHER.
  • The OCRSpace free API returns word-level bounding boxes, which are used to position each text line at the correct coordinates in the PDF.
  • Font size is auto-calculated to match the bounding box width of each OCR-detected line, ensuring accurate text selection.
  • This approach integrates into any Java document management pipeline — scan via TWAIN/WIA/SANE, OCR the image, and output a searchable PDF.

Common Developer Questions

  • How do I make a scanned PDF searchable in Java using OCR?
  • How do I add an invisible text layer to a PDF image page with PDFBox?
  • How do I use the OCRSpace API to get word coordinates from a scanned document?

Prerequisites

Step 1: Extract Text from Scanned Documents with OCR

There are various OCR engines or API services we can use. Here, we use OCRSpace’s free OCR API.

  1. Create related definitions.

    TextLine.class:

    public class TextLine {
        public double left;
        public double top;
        public double width;
        public double height;
        public String text;
    
        public TextLine(double left, double top, double width, double height, String text) {
            this.left = left;
            this.top = top;
            this.width = width;
            this.height = height;
            this.text = text;
        }
    }
    

    OCRResult.class:

    public class OCRResult {
        public ArrayList<TextLine> lines = new ArrayList<TextLine>();
    }
    
  2. Create an OCRSpace class which provides a static method to get the OCR result from a base64-encoded image.

    public class OCRSpace {
        public static String key = "";
        public static String lang = "eng";
        /**
         * Get OCR result from a base64-encoded image in JPEG format
         *
         * @param base64 - base64-encoded image
         *
         */
        public static OCRResult detect(String base64) throws IOException {
            OCRResult result = new OCRResult();
            OkHttpClient client = new OkHttpClient.Builder()
                    .connectTimeout(120, TimeUnit.SECONDS)
                    .build();
            RequestBody requestBody=new FormBody.Builder()
                    .add("apikey",key)
                    .add("language",lang)
                    .add("base64Image","data:image/jpeg;base64,"+base64.trim())
                    .add("isOverlayRequired","true")
                    .build();
    
            Request httpRequest = new Request.Builder()
                    .url("https://api.ocr.space/parse/image")
                    .post(requestBody)
                    .build();
            try (Response response = client.newCall(httpRequest).execute()) {
                try {
                    String json = response.body().string();
                    parse(json,result);
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
            return result;
        }
    
        private static void parse(String json,OCRResult ocrResult) throws JsonProcessingException {
            ObjectMapper objectMapper = new ObjectMapper();
            Map<String,Object> body = objectMapper.readValue(json,new TypeReference<Map<String,Object>>() {});
            List<Map<String,Object>> parsedResults = (List<Map<String, Object>>) body.get("ParsedResults");
            for (Map<String,Object> parsedResult:parsedResults) {
                Map<String,Object> textOverlay = (Map<String, Object>) parsedResult.get("TextOverlay");
                List<Map<String,Object>> lines = (List<Map<String, Object>>) textOverlay.get("Lines");
                for (Map<String,Object> line:lines) {
                    TextLine textLine = parseAsTextLine(line);
                    ocrResult.lines.add(textLine);
                }
            }
        }
    
        private static TextLine parseAsTextLine(Map<String,Object> line){
            String lineText = (String) line.get("LineText");
            List<Map<String,Object>> words = (List<Map<String, Object>>) line.get("Words");
            int minX = (int)((double) words.get(0).get("Left"));
            int minY = (int)((double) words.get(0).get("Top"));
            int maxX = 0;
            int maxY = 0;
            for (Map<String,Object> word:words) {
                int x = (int)((double) word.get("Left"));
                int y = (int)((double) word.get("Top"));
                int width = (int)((double) word.get("Width"));
                int height = (int)((double) word.get("Height"));
                minX = Math.min(minX,x);
                minY = Math.min(minY,y);
                maxX = Math.max(maxX,x+width);
                maxY = Math.max(maxY,y+height);
            }
            return new TextLine(minX,minY,maxX - minX,maxY-minY,lineText);
        }
    }
    

    Here, we use OKHttp for HTTP requests and Jackson as the JSON library.

Step 2: Add an Invisible Text Overlay to Each PDF Page

  1. Create a SearchablePDFCreator class for related methods.

    public class SearchablePDFCreator {}
    
  2. Add an addTextOverlay method to add text overlay to an existing PDF page.

    /**
     * Add text overlay to an existing PDF page
     * @param contentStream - PDF content stream
     * @param result - OCR result
     * @param pageHeight - Height of the image
     * @param pdFont - Specify a font for evaluation of the position
     * @param percent - image's height / page's height
     */
    public static void addTextOverlay(PDPageContentStream contentStream,OCRResult result, double pageHeight, PDFont pdFont,double percent) throws IOException {
        PDFont font = pdFont;
        contentStream.setFont(font, 16);
        contentStream.setRenderingMode(RenderingMode.NEITHER);
        for (int i = 0; i <result.lines.size() ; i++) {
            TextLine line = result.lines.get(i);
            FontInfo fi = calculateFontSize(font,line.text, (float) (line.width * percent), (float) (line.height * percent));
            contentStream.beginText();
            contentStream.setFont(font, fi.fontSize);
            contentStream.newLineAtOffset((float) (line.left * percent), (float) ((pageHeight - line.top - line.height) * percent));
            contentStream.showText(line.text);
            contentStream.endText();
        }
    }
    
    private static FontInfo calculateFontSize(PDFont font, String text, float bbWidth, float bbHeight) throws IOException {
        int fontSize = 17;
        float textWidth = font.getStringWidth(text) / 1000 * fontSize;
        float textHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;
    
        if(textWidth > bbWidth){
            while(textWidth > bbWidth){
                fontSize -= 1;
                textWidth = font.getStringWidth(text) / 1000 * fontSize;
                textHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;
            }
        }
        else if(textWidth < bbWidth){
            while(textWidth < bbWidth){
                fontSize += 1;
                textWidth = font.getStringWidth(text) / 1000 * fontSize;
                textHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;
            }
        }
    
        FontInfo fi = new FontInfo();
        fi.fontSize = fontSize;
        fi.textHeight = textHeight;
        fi.textWidth = textWidth;
    
        return fi;
    }
    

    The font size is automatically calculated based on the font specified and the line’s width.

  3. Add an addPage method to add text overlay along with the image as a new page to a document.

    public static void addPage(byte[] imageBytes,OCRResult result, PDDocument document,int pageIndex,PDFont pdFont) throws IOException {
        ByteArrayInputStream in = new ByteArrayInputStream(imageBytes);
        BufferedImage bi = ImageIO.read(in);
        // Create a new PDF page
        PDRectangle rect = new PDRectangle((float) bi.getWidth(),(float) bi.getHeight());
        PDPage page = new PDPage(rect);
        document.addPage(page);
        PDPageContentStream contentStream = new PDPageContentStream(document, page);
        PDImageXObject image
                = PDImageXObject.createFromByteArray(document,imageBytes,String.valueOf(pageIndex));
        contentStream.drawImage(image, 0, 0);
        addTextOverlay(contentStream,result,bi.getHeight(),pdFont);
        contentStream.close();
    }
    

Let’s examine the result.

Using RenderingMode.NEITHER will make the text layer invisible. We can comment out this line to see the text overlayed. The following is a region of a PDF file with the text overlay. We can see that the text fits the image closely.

text overlay

Step 3: Save a Scanned Image as a Searchable PDF

Next, we can try to use the classes we just wrote to create a searchable PDF from an image.

File image = new File("F://WebTWAINImage.jpg");
byte[] byteArray = new byte[(int) image.length()];
try (FileInputStream inputStream = new FileInputStream(image)) {
    inputStream.read(byteArray);
} catch (IOException e) {
    throw new RuntimeException(e);
}
String base64 = Base64.getEncoder().encodeToString(byteArray);
OCRSpace.key = "your key";
OCRResult result = OCRSpace.detect(base64);
PDDocument document = new PDDocument();
SearchablePDFCreator.addPage(byteArray,result,document,0);
document.save(new File("F://output.pdf"));
document.close();

Step 4: Integrate Searchable PDF into a JavaFX Document Scanner

  1. Add the library as a dependency editing pom.xml.

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>
    <dependencies>
        <dependency>
            <groupId>com.github.tony-xlh</groupId>
            <artifactId>searchablePDF4j</artifactId>
            <version>1.0.0</version>
        </dependency>
    </dependencies>
    
  2. Add a checkbox to enable searchable PDF generation in the UI.
  3. If the checkbox is selected, generate a searchable PDF by adding a text overlay.

     PDDocument document = new PDDocument();
     int index = 0;
     for (DocumentImage di: documentListView.getItems()) {
         index = index + 1;
         ImageView imageView = di.imageView;
         PDRectangle rect = new PDRectangle((float) imageView.getImage().getWidth(),(float) imageView.getImage().getHeight());
         System.out.println(rect);
         PDPage page = new PDPage(rect);
         document.addPage(page);
         PDPageContentStream contentStream = new PDPageContentStream(document, page);
         PDImageXObject image
                 = PDImageXObject.createFromByteArray(document,di.image,String.valueOf(index));
         contentStream.drawImage(image, 0, 0);
    +    if (searchablePDFCheckBox.isSelected()) {
    +        String base64 = Base64.getEncoder().encodeToString(di.image);
    +        OCRSpace.key = "your key";
    +        OCRResult result = OCRSpace.detect(base64);
    +        SearchablePDFCreator.addTextOverlay(contentStream,result,image.getHeight());
    +    }
         contentStream.close();
     }
     document.save(fileToSave.getAbsolutePath());
     document.close();
    

Common Issues and Edge Cases

  • OCR returns empty results for low-resolution scans. The OCRSpace API works best with images at 300 DPI or higher. If your scanned images are below 200 DPI, upscale them before sending to the API, or the text overlay will be incomplete.
  • Non-Latin characters render incorrectly in the text overlay. The default PDFBox font (Helvetica) does not support CJK or other non-Latin scripts. Load a TrueType font that covers your target language using PDType0Font.load() and pass it to addTextOverlay.
  • Text selection in the PDF is slightly misaligned. This happens when the PDF page size differs from the original image dimensions. Ensure the PDRectangle matches the scanned image width and height exactly, and verify the percent scaling factor is 1.0 when page and image sizes match.

Source Code

Check out the code to have a try: