Abstract:
Natural Language Processing (NLP) is an area of research and application that explores how computers can be used to manipulate natural language. One important research area of NLP that has been the focus of researchers is Optical Character Recognition (OCR). OCR is a method which is used to convert the handwritten or printed scanned documents to editable texts.
In this research, the recognition of handwritten character with Support Vector Machine (SVM) implementation for the 202 main character set of Ge’ez language is attempted. The training and testing data sets are collected from Ge’ez vellum books. In this research, the researchers used various techniques at each phase from digitization to recognition levels. MATLAB image processing is used for experimentations. The iterative thresholding for binarizing the digitized image, bi-level filtering for noise removal, nearest neighbor interpolation for normalization, morphological analysis for thinning and horizontal profile for feature extraction methods are found to work very well for the problem of interest. Segmentation rate of 90.5% and 77.4% are attained using stage by stage segmentation algorithm for noise free and noisy image document respectively.
SVM is used for classification. The SVM classifier is trained with 12 pages documents (7 pages from noise free and 5 pages from noisy documents) which are taken from real-life Ge’ez documents. The Classifier is also tested with 6 pages document (4 pages from noise free and 2 pages from noisy documents) that are not included in the training datasets. Accordingly, an average recognition rate of 63.4% and 51.7% are registered for noise free and noisy document images, respectively. The performance of the system is greatly affected by the similarity of the shape of Ge’ez characters and effectiveness of the preprocessing techniques. Invariant to shape feature extraction techniques and advanced noise detection and removal algorithms should be investigated in the future.