Build a Synthetic MRZ Images Dataset
MRZ stands for “machine-readable zone”. It is usually at the bottom of an identity page for machines to read its info like document type, name, nationality, date of birth, sex and expiration date, etc.
A dataset of MRZ images is needed to train an OCR engine of MRZ or evaluate the performance of an OCR engine. However, because identity documents contain sensitive personal information, it is not possible to collect a large volume of those images.
In this article, we are going to build a synthetic MRZ image dataset with the following steps:
- Collect some sample passport images of different countries from the internet.
- Detect the MRZ string and erase them from the image using inpainting. Because we only need to recognize the MRZ string, it is not neccessary to erase other text.
- Randomly generate MRZ strings with different names, sexes, countries and dates.
- Draw the MRZ string on the MRZ-removed passport images.
We will talk about the details in the following parts.
Inpainting
-
Create text masks.
After detection, we can use binarization to segment the MRZ code to create text masks.
Original image example:
Text mask example:
-
Inpainting. We use lama-inpaint to create MRZ-removed images using the original images and the mask images.
MRZ-removed image example:
Random MRZ Generation
For MRZ generation, we can use the mrz python library to create MRZ strings conforming to MRZ specifications.
-
Generate random names.
import names import random def random_surname(): return names.get_last_name() def random_given_names(): return names.get_first_name()
-
Generate a random sex.
sex = random.choice(['M', 'F'])
-
Generate a random country from a predefined dict.
COUNTRIES = { "BLR":"Belarus-passport-mini.jpg", "BEL":"Belgium-passport-mini.jpg", "BGR":"Bulgaria-passport-mini.jpg", "CAN":"Canada-passport-mini.jpg", "CHL":"Chile-passport-mini.jpg", "CHN":"China-passport-mini.jpg", "DOM":"Dominicana-passport-mini.jpg", "EST":"Estonia-passport-mini.jpg", "D":"Germany-passport-mini.jpg", "IDN":"Indonesia-passport-mini.jpg", "IRL":"Ireland-passport-mini.jpg", "ITA":"Italy-passport-mini.jpg", "JPN":"Japanese-passport.jpg", "KAZ":"Kazakhstan-passport-mini.jpg", "MEX":"Mexico-passport-mini.jpg", "MDA":"Moldova-passport-mini.jpg", "NLD":"Netherlands-passport-mini.jpg", "POL":"Poland-passport-mini.jpg", "ROU":"Romania-passport-mini.jpg", "SVK":"Slovakia-passport-mini.jpg", "ESP":"Spain-passport-mini.jpg", "GBR":"United-kingdom-of-great-britain-passport-mini.jpg", "URY":"Uruguay-passport-mini.jpg", "UZB":"Uzbekistan-passport-mini.jpg", "USA":"USA-Passport.jpg" } nationality = random.choice(list(COUNTRIES.keys()))
-
Generate a random document number.
def random_string(length=10, allowed_chars='ABCDEFGHIJKLMNOPQRSTUVWXYZ'): return ''.join(random.choice(allowed_chars) for i in range(length)) document_number = random_string(9, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
-
Generate a random date.
def random_date(start_year=1900, end_year=datetime.datetime.now().year): year = random.randint(start_year, end_year) month = random.randint(1, 12) if month in [1, 3, 5, 7, 8, 10, 12]: day = random.randint(1, 31) elif month in [4, 6, 9, 11]: day = random.randint(1, 30) else: # February if (year % 4 == 0 and year % 100 != 0) or (year % 400 == 0): # leap year day = random.randint(1, 29) else: day = random.randint(1, 28) return datetime.date(year, month, day)
-
Generate the MRZ code based on the random values using the functions above.
MRZ_TYPES = ['TD1','TD2','TD3','MRVA','MRVB'] def generate_MRZ(doc_type,country,surname,given_names,document_number,nationality,birth_date,sex,expiry_date,optional1,optional2): code = "" if doc_type == "TD1": code = mrz.generator.td1.TD1CodeGenerator("I", country, document_number, birth_date, sex, expiry_date,nationality, surname, given_names, optional1, optional2) elif doc_type == "TD2": code = mrz.generator.td2.TD2CodeGenerator("I", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1) elif doc_type == "TD3": code = mrz.generator.td3.TD3CodeGenerator("P", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1) elif doc_type == "MRVA": code = mrz.generator.mrva.MRVACodeGenerator("V", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1) elif doc_type == "MRVB": code = mrz.generator.mrvb.MRVBCodeGenerator("V", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1) return code def random_generate(doc_type="",nationality="GBR"): surname = random_surname() given_names = random_given_names() if nationality == "" or nationality == None: nationality = random.choice(list(COUNTRIES.keys())) sex = random.choice(['M', 'F']) if doc_type == "" or doc_type == None: doc_type = random.choice(MRZ_TYPES) document_number = random_string(9, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789') birth_date = random_date().strftime('%y%m%d') expiry_date = random_date(start_year=datetime.datetime.now( ).year, end_year=datetime.datetime.now().year + 10).strftime('%y%m%d') code = generate_MRZ(doc_type,nationality,surname,given_names,document_number,nationality,birth_date,sex,expiry_date,"","") return code
Drawing the Text
We can draw the text with Pillow using the OCR-B
font which is used by MRZ. The text is drawn based on the original position and width of the MRZ string.
def mrz_filled(code,nationality):
code = str(code)
f = open("images/1.itp","r",encoding="utf-8") #A JSON file holding the position and size of the MRZ string
content = f.read()
f.close()
project = json.loads(content)
img_name = COUNTRIES[nationality]
images = project["images"]
image = images[img_name]
boxes = image["boxes"]
box1 = boxes[0]
box2 = boxes[1]
width = box1["geometry"]["width"]
font_size = int(width/1828*56)
img = Image.open(os.path.join("images",img_name+"-text-removed.jpg"))
draw = ImageDraw.Draw(img)
font = ImageFont.truetype("OCRB-Regular.ttf", font_size)
draw.text((box1["geometry"]["X"], box1["geometry"]["Y"]), code.split("\n")[0], fill ="black", font = font, align ="right")
draw.text((box2["geometry"]["X"], box2["geometry"]["Y"]), code.split("\n")[1], fill ="black", font = font, align ="right")
return img
You can find the synthetic image dataset we created in this repo: https://github.com/tony-xlh/MRZ-dataset/tree/gh-pages/benchmark/dataset/Passports
Source Code
Get the source code of the synthesizer to create your own dataset:
https://github.com/tony-xlh/SynthMRZ
You can run an MRZ benchmark on your dataset with this benchmark tool.