Template Matching Based Probabilistic Optical Character Recognition for Urdu Nastaliq Script

  • Qaiser Abbas Department of Computer Science & IT, University of Sargodha, Sargodha, 40100, Pakistan
Keywords: OCR, Template matching, N-Grams, Segmentation, Stochastic, Nastaliq, Urdu

Abstract

This paper presents a technique for optical recognition of Urdu characters using template matching based on a probabilistic N-Gram language model. Dataset used has the collection of both printed and typed text. This model is able to perform three types of segmentations including line, ligature and character using horizontal projection, connected component labeling, corners and pointers techniques, respectively. A separate stochastic lexicon is built from a collected corpus, which contains the probability values of grams. By using template matching and the N-Gram language model, our study predicts complete segmented words with the promising result, particularly in case of bigrams. It outperforms three out of four existing models with an accuracy rate of 97.33%. Results achieved on our test dataset are encouraging in one perspective but provide direction to work for further improvement in this model.

Published
2021-06-21
How to Cite
Qaiser Abbas. (2021). Template Matching Based Probabilistic Optical Character Recognition for Urdu Nastaliq Script. Lahore Garrison University Research Journal of Computer Science and Information Technology, 5(2), 41-47. https://doi.org/10.54692/lgurjcsit.2021.0502207
Section
Articles