Abstract:
A spelling checker is one of the natural language processing (NLP) applications that can detect and
provide possible suggestions for the incorrectly spelled words in a given text. There are two classes
of spelling errors: non-word and real-word spelling errors. Non-word errors are incorrectly typed
words that have no meaning and cannot be found in the dictionary of a specific language. Real-word
errors are valid words that have meaning but are contextually incorrect. It is common to see both
types of spelling errors (non-word and real-word) in the Ge’ez language written materials.
In this study, a context-dependent spelling error detector and corrector for the Ge’ez language is
designed and developed using design science research methodology. The study focuses on proposing
solutions for the problems of using homophone Ge’ez alphabets incorrectly that they bring
completely different meanings in a given sentence. The proposed context-dependent spell checker
architecture contains three main components: text preprocessing, error detection, and error
correction. N-gram technique is applied to detect and correct context-dependent spelling errors.
Suggestions are selected from the possible ways of writing the contextually detected words.
The experiment is conducted using 16918 sentences for training 650 sentences for testing data. The
test data contains 6463 words, and out of these 461 words were contextually incorrect words
(artificially inflected). The bigram-based and trigram-based experiment is carried out independently
and the one which has better performance has been recommended. The performance of the prototype
scored 99.79% of lexical precision, 66.52% of error precision, 96.23% lexical recall, 97.39% error
recall, and 96.31% of accuracy from the bigram-based spell checker experiment. On the other hand,
the trigram-based experiment scored 99.94% of lexical precision, 67.65% of error precision, 96.35%
of lexical recall, 99.34% of error recall, and 96.56% of accuracy. As a result, the trigram-based
context-dependent spell checker scored better performance than the bigram-based experiment.
In this study, back of smoothing technique is used to lookup the unseen trigrams in a bigram
language model to check the existence of context-dependent spelling errors. But if the word still
does not exist in the bigram language model after back off, the spell checker considered the word as
contextually invalid word even if it is valid. So further research is required to handle the problems
of unseen words in the language model