Build a Synthetic MRZ Images Dataset

MRZ stands for “machine-readable zone”. It is usually at the bottom of an identity page for machines to read its info like document type, name, nationality, date of birth, sex and expiration date, etc.

A dataset of MRZ images is needed to train an OCR engine of MRZ or evaluate the performance of an OCR engine. However, because identity documents contain sensitive personal information, it is not possible to collect a large volume of those images.

In this article, we are going to build a synthetic MRZ image dataset with the following steps:

  1. Collect some sample passport images of different countries from the internet.
  2. Detect the MRZ string and erase them from the image using inpainting. Because we only need to recognize the MRZ string, it is not neccessary to erase other text.
  3. Randomly generate MRZ strings with different names, sexes, countries and dates.
  4. Draw the MRZ string on the MRZ-removed passport images.

We will talk about the details in the following parts.

Inpainting

  1. Create text masks.

    After detection, we can use binarization to segment the MRZ code to create text masks.

    Original image example:

    USA Passport

    Text mask example:

    Text mask

  2. Inpainting. We use lama-inpaint to create MRZ-removed images using the original images and the mask images.

    MRZ-removed image example:

    Text-removed USA Passport

Random MRZ Generation

For MRZ generation, we can use the mrz python library to create MRZ strings conforming to MRZ specifications.

  • Generate random names.

     import names
     import random
     def random_surname():
         return names.get_last_name()
    
     def random_given_names():
         return names.get_first_name()
    
  • Generate a random sex.

     sex = random.choice(['M', 'F'])
    
  • Generate a random country from a predefined dict.

     COUNTRIES = {
         "BLR":"Belarus-passport-mini.jpg",
         "BEL":"Belgium-passport-mini.jpg",
         "BGR":"Bulgaria-passport-mini.jpg",
         "CAN":"Canada-passport-mini.jpg",
         "CHL":"Chile-passport-mini.jpg",
         "CHN":"China-passport-mini.jpg",
         "DOM":"Dominicana-passport-mini.jpg",
         "EST":"Estonia-passport-mini.jpg",
         "D":"Germany-passport-mini.jpg",
         "IDN":"Indonesia-passport-mini.jpg",
         "IRL":"Ireland-passport-mini.jpg",
         "ITA":"Italy-passport-mini.jpg",
         "JPN":"Japanese-passport.jpg",
         "KAZ":"Kazakhstan-passport-mini.jpg",
         "MEX":"Mexico-passport-mini.jpg",
         "MDA":"Moldova-passport-mini.jpg",
         "NLD":"Netherlands-passport-mini.jpg",
         "POL":"Poland-passport-mini.jpg",
         "ROU":"Romania-passport-mini.jpg",
         "SVK":"Slovakia-passport-mini.jpg",
         "ESP":"Spain-passport-mini.jpg",
         "GBR":"United-kingdom-of-great-britain-passport-mini.jpg",
         "URY":"Uruguay-passport-mini.jpg",
         "UZB":"Uzbekistan-passport-mini.jpg",
         "USA":"USA-Passport.jpg"
     }
       
     nationality = random.choice(list(COUNTRIES.keys()))
    
  • Generate a random document number.

     def random_string(length=10, allowed_chars='ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
         return ''.join(random.choice(allowed_chars) for i in range(length))
     document_number = random_string(9, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
    
  • Generate a random date.

     def random_date(start_year=1900, end_year=datetime.datetime.now().year):
         year = random.randint(start_year, end_year)
         month = random.randint(1, 12)
    
         if month in [1, 3, 5, 7, 8, 10, 12]:
             day = random.randint(1, 31)
         elif month in [4, 6, 9, 11]:
             day = random.randint(1, 30)
         else:  # February
             if (year % 4 == 0 and year % 100 != 0) or (year % 400 == 0):  # leap year
                 day = random.randint(1, 29)
             else:
                 day = random.randint(1, 28)
    
         return datetime.date(year, month, day)
    
  • Generate the MRZ code based on the random values using the functions above.

     MRZ_TYPES = ['TD1','TD2','TD3','MRVA','MRVB']
     def generate_MRZ(doc_type,country,surname,given_names,document_number,nationality,birth_date,sex,expiry_date,optional1,optional2):
         code = ""
         if doc_type == "TD1":
             code = mrz.generator.td1.TD1CodeGenerator("I", country, document_number, birth_date, sex, expiry_date,nationality, surname, given_names, optional1, optional2)
         elif doc_type == "TD2":
             code = mrz.generator.td2.TD2CodeGenerator("I", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1)
         elif doc_type == "TD3":
             code = mrz.generator.td3.TD3CodeGenerator("P", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1)
         elif doc_type == "MRVA":
             code = mrz.generator.mrva.MRVACodeGenerator("V", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1)
         elif doc_type == "MRVB":
             code = mrz.generator.mrvb.MRVBCodeGenerator("V", country, surname, given_names, document_number, nationality, birth_date, sex, expiry_date, optional1)
         return code
           
     def random_generate(doc_type="",nationality="GBR"):
         surname = random_surname()
         given_names = random_given_names()
         if nationality == "" or nationality == None:
             nationality = random.choice(list(COUNTRIES.keys()))
         sex = random.choice(['M', 'F'])
         if doc_type == "" or doc_type == None:
             doc_type = random.choice(MRZ_TYPES)
         document_number = random_string(9, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
         birth_date = random_date().strftime('%y%m%d')
         expiry_date = random_date(start_year=datetime.datetime.now(
         ).year, end_year=datetime.datetime.now().year + 10).strftime('%y%m%d')
         code = generate_MRZ(doc_type,nationality,surname,given_names,document_number,nationality,birth_date,sex,expiry_date,"","")
         return code
    

Drawing the Text

We can draw the text with Pillow using the OCR-B font which is used by MRZ. The text is drawn based on the original position and width of the MRZ string.

def mrz_filled(code,nationality):
    code = str(code)
    f = open("images/1.itp","r",encoding="utf-8") #A JSON file holding the position and size of the MRZ string
    content = f.read()
    f.close()
    project = json.loads(content)
    img_name = COUNTRIES[nationality]
    images = project["images"]
    image = images[img_name]
    boxes = image["boxes"]
    box1 = boxes[0]
    box2 = boxes[1]
    width = box1["geometry"]["width"]
    font_size = int(width/1828*56)
    img = Image.open(os.path.join("images",img_name+"-text-removed.jpg"))
    draw = ImageDraw.Draw(img)
    font = ImageFont.truetype("OCRB-Regular.ttf", font_size)
    draw.text((box1["geometry"]["X"], box1["geometry"]["Y"]), code.split("\n")[0], fill ="black", font = font, align ="right")  
    draw.text((box2["geometry"]["X"], box2["geometry"]["Y"]), code.split("\n")[1], fill ="black", font = font, align ="right")  
    return img

You can find the synthetic image dataset we created in this repo: https://github.com/tony-xlh/MRZ-dataset/tree/gh-pages/benchmark/dataset/Passports

Source Code

Get the source code of the synthesizer to create your own dataset:

https://github.com/tony-xlh/SynthMRZ

You can run an MRZ benchmark on your dataset with this benchmark tool.