Context-Dependent Spelling Error Detection and Correction For Ge’ez Language

Context-Dependent Spelling Error Detection and Correction For Ge’ez Language

Tesfie, Aleka

URI: http://hdl.handle.net/123456789/4953

Date: 2022-02-04

Abstract:

A spelling checker is one of the natural language processing (NLP) applications that can detect and provide possible suggestions for the incorrectly spelled words in a given text. There are two classes of spelling errors: non-word and real-word spelling errors. Non-word errors are incorrectly typed words that have no meaning and cannot be found in the dictionary of a specific language. Real-word errors are valid words that have meaning but are contextually incorrect. It is common to see both types of spelling errors (non-word and real-word) in the Ge’ez language written materials. In this study, a context-dependent spelling error detector and corrector for the Ge’ez language is designed and developed using design science research methodology. The study focuses on proposing solutions for the problems of using homophone Ge’ez alphabets incorrectly that they bring completely different meanings in a given sentence. The proposed context-dependent spell checker architecture contains three main components: text preprocessing, error detection, and error correction. N-gram technique is applied to detect and correct context-dependent spelling errors. Suggestions are selected from the possible ways of writing the contextually detected words. The experiment is conducted using 16918 sentences for training 650 sentences for testing data. The test data contains 6463 words, and out of these 461 words were contextually incorrect words (artificially inflected). The bigram-based and trigram-based experiment is carried out independently and the one which has better performance has been recommended. The performance of the prototype scored 99.79% of lexical precision, 66.52% of error precision, 96.23% lexical recall, 97.39% error recall, and 96.31% of accuracy from the bigram-based spell checker experiment. On the other hand, the trigram-based experiment scored 99.94% of lexical precision, 67.65% of error precision, 96.35% of lexical recall, 99.34% of error recall, and 96.56% of accuracy. As a result, the trigram-based context-dependent spell checker scored better performance than the bigram-based experiment. In this study, back of smoothing technique is used to lookup the unseen trigrams in a bigram language model to check the existence of context-dependent spelling errors. But if the word still does not exist in the bigram language model after back off, the spell checker considered the word as contextually invalid word even if it is valid. So further research is required to handle the problems of unseen words in the language model

Show full item record