How to Build a Synthetic MRZ Dataset for OCR Training and Testing
MRZ stands for “machine-readable zone”. It is usually at the bottom of an identity page for machines to read its info like document type, name, nationality, date of birth, sex and expiration date, etc.
A dataset of MRZ images is needed to train an OCR engine of MRZ or evaluate the performance of an OCR engine. However, because identity documents contain sensitive personal information, it is not possible to collect a large volume of those images.
What you’ll build: A synthetic MRZ image dataset using Python — combining inpainting, random MRZ string generation, and OCR-B font rendering on real passport templates from 25 countries.
Key Takeaways
- Synthetic MRZ datasets solve the privacy problem of collecting real passport images for OCR training and testing.
- Python’s
mrzlibrary generates spec-compliant MRZ strings for TD1, TD2, TD3, MRVA, and MRVB document types. - LaMa inpainting cleanly removes existing MRZ text from passport templates while preserving the background.
- The resulting dataset supports benchmarking any MRZ OCR engine, including Dynamsoft Label Recognizer.
Common Developer Questions
- How do I generate synthetic MRZ images for OCR training in Python?
- What is the best way to create mock passport images for MRZ testing?
- How can I benchmark MRZ OCR accuracy with a synthetic dataset?
Prerequisites
- Python 3.7 or later
- Libraries:
mrz,names,Pillow,opencv-python - Sample passport images from different countries
- The OCR-B font file (
OCRB-Regular.ttf) - Get a 30-day free trial license to test the dataset with Dynamsoft Label Recognizer
In this article, we are going to build a synthetic MRZ image dataset with the following steps:
- Collect some sample passport images of different countries from the internet.
- Detect the MRZ string and erase them from the image using inpainting. Because we only need to recognize the MRZ string, it is not neccessary to erase other text.
- Randomly generate MRZ strings with different names, sexes, countries and dates.
- Draw the MRZ string on the MRZ-removed passport images.
We will talk about the details in the following parts.
Step 1: Remove Existing MRZ Text with Inpainting
-
Create text masks.
After detection, we can use binarization to segment the MRZ code to create text masks.
Original image example:

Text mask example:

-
Inpainting. We use lama-inpaint to create MRZ-removed images using the original images and the mask images.
MRZ-removed image example:

Step 2: Generate Random MRZ Strings with Python
For MRZ generation, we can use the mrz python library to create MRZ strings conforming to MRZ specifications.
-
Generate random names.
import names import random def random_surname(): return names.get_last_name() def random_given_names(): return names.get_first_name() -
Generate a random sex.
sex = random.choice(['M', 'F']) -
Generate a random country from a predefined dict.
COUNTRIES = { "BLR":"Belarus-passport-mini.jpg", "BEL":"Belgium-passport-mini.jpg", "BGR":"Bulgaria-passport-mini.jpg", "CAN":"Canada-passport-mini.jpg", "CHL":"Chile-passport-mini.jpg", "CHN":"China-passport-mini.jpg", "DOM":"Dominicana-passport-mini.jpg", "EST":"Estonia-passport-mini.jpg", "D":"Germany-passport-mini.jpg", "IDN":"Indonesia-passport-mini.jpg", "IRL":"Ireland-passport-mini.jpg", "ITA":"Italy-passport-mini.jpg", "JPN":"Japanese-passport.jpg", "KAZ":"Kazakhstan-passport-mini.jpg", "MEX":"Mexico-passport-mini.jpg", "MDA":"Moldova-passport-mini.jpg", "NLD":"Netherlands-passport-mini.jpg", "POL":"Poland-passport-mini.jpg", "ROU":"Romania-passport-mini.jpg", "SVK":"Slovakia-passport-mini.jpg", "ESP":"Spain-passport-mini.jpg", "GBR":"United-kingdom-of-great-britain-passport-mini.jpg", "URY":"Uruguay-passport-mini.jpg", "UZB":"Uzbekistan-passport-mini.jpg", "USA":"USA-Passport.jpg" } nationality = random.choice(list(COUNTRIES.keys())) -
Generate a random document number.
def random_string(length=10, allowed_chars='ABCDEFGHIJKLMNOPQRSTUVWXYZ'): return ''.join(random.choice(allowed_chars) for i in range(length)) document_number = random_string(9, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789') -
Generate a random date.
def random_date(start_year=1900, end_year=datetime.datetime.now().year): year = random.randint(start_year, end_year) month = random.randint(1, 12) if month in [1, 3, 5, 7, 8, 10, 12]: day = random.randint(1, 31) elif month in [4, 6, 9, 11]: day = random.randint(1, 30) else: # February if (year % 4 == 0 and year % 100 != 0) or (year % 400 == 0): # leap year day = random.randint(1, 29) else: day = random.randint(1, 28) return datetime.date(year, month, day) -
Generate the MRZ code based on the random values using the functions above.
MRZ_TYPES = ['TD1','TD2','TD3','MRVA','MRVB'] def generate_MRZ(doc_type,country,surname,given_names,document_number,nationality,birth_date,sex,expiry_date,optional1,optional2): code = "" if doc_type == "TD1": code = mrz.generator.td1.TD1CodeGenerator("I", country, document_number, birth_date, sex, expiry_date,nationality, surname, given_names, optional1, optional2) elif doc_type == "TD2": code = mrz.generator.td2.TD2CodeGenerator("I", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1) elif doc_type == "TD3": code = mrz.generator.td3.TD3CodeGenerator("P", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1) elif doc_type == "MRVA": code = mrz.generator.mrva.MRVACodeGenerator("V", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1) elif doc_type == "MRVB": code = mrz.generator.mrvb.MRVBCodeGenerator("V", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1) return code def random_generate(doc_type="",nationality="GBR"): surname = random_surname() given_names = random_given_names() if nationality == "" or nationality == None: nationality = random.choice(list(COUNTRIES.keys())) sex = random.choice(['M', 'F']) if doc_type == "" or doc_type == None: doc_type = random.choice(MRZ_TYPES) document_number = random_string(9, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789') birth_date = random_date().strftime('%y%m%d') expiry_date = random_date(start_year=datetime.datetime.now( ).year, end_year=datetime.datetime.now().year + 10).strftime('%y%m%d') code = generate_MRZ(doc_type,nationality,surname,given_names,document_number,nationality,birth_date,sex,expiry_date,"","") return code
Step 3: Render MRZ Text onto Passport Images
We can draw the text with Pillow using the OCR-B font which is used by MRZ. The text is drawn based on the original position and width of the MRZ string.
def mrz_filled(code,nationality):
code = str(code)
f = open("images/1.itp","r",encoding="utf-8") #A JSON file holding the position and size of the MRZ string
content = f.read()
f.close()
project = json.loads(content)
img_name = COUNTRIES[nationality]
images = project["images"]
image = images[img_name]
boxes = image["boxes"]
box1 = boxes[0]
box2 = boxes[1]
width = box1["geometry"]["width"]
font_size = int(width/1828*56)
img = Image.open(os.path.join("images",img_name+"-text-removed.jpg"))
draw = ImageDraw.Draw(img)
font = ImageFont.truetype("OCRB-Regular.ttf", font_size)
draw.text((box1["geometry"]["X"], box1["geometry"]["Y"]), code.split("\n")[0], fill ="black", font = font, align ="right")
draw.text((box2["geometry"]["X"], box2["geometry"]["Y"]), code.split("\n")[1], fill ="black", font = font, align ="right")
return img
You can find the synthetic image dataset we created in this repo: https://github.com/tony-xlh/MRZ-dataset/tree/gh-pages/benchmark/dataset/Passports
Common Issues and Edge Cases
- Font size mismatch: If the rendered MRZ text doesn’t align with the passport template, verify that the font size calculation uses the correct reference width (1828 px in our case). Different source images may need recalibration.
- Inpainting artifacts near borders: LaMa inpainting can leave visible seams when the MRZ zone is close to the document edge. Expanding the mask by a few pixels before inpainting typically eliminates these artifacts.
- Invalid MRZ check digits: The
mrzlibrary handles check digit computation automatically. If you manually modify MRZ strings after generation, recalculate the check digits or the output will fail validation by any standards-compliant reader.
Source Code
Get the source code of the synthesizer to create your own dataset:
https://github.com/tony-xlh/SynthMRZ
You can run an MRZ benchmark on your dataset with this benchmark tool.