Abstract:
All human languages have words that can mean different things in different contexts. Word sense disambiguation (WSD) is an open problem of natural language processing, which governs the process of identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings (polysemy).
In this paper, we are concerned with a corpus based approach to word sense disambiguation for Tigrigna texts that only requires information that can be automatically extracted from untagged text. We use unsupervised techniques to address the problem of automatically deciding the correct sense of an ambiguous word based on its surrounding context. And we report experiments on four selected Tigrigna ambiguous words due to lack of sufficient training data; these are መዯብ read as “medeb” has three different meaning (Program, Traditional bed and Grouping), ሓሇፈ read as “halefe”; has four dissimilar meanings (Pass, Promote, Boss and Pass away), ሃዯመ read as “hademe”; has two different meaning (Running and Building house) and, ከበረ read as “kebere”; has two different meaning (Respecting and Expensive).
For the purposes of this research, unsupervised machine learning technique was applied to a corpus of Tigrigna sentences so as to acquire disambiguation information automatically. A total of 631 sense examples transcribed to Latin script for the four ambiguous words were collected from different online Tigrigna websites and newspapers.
Finally we tested five clustering algorithms (simple k means, hierarchical agglomerative: Single, Average and complete link and Expectation Maximization algorithms) in the existing implementation of Weka 3.8.1 package. “Use training set” evaluation mode was selected to learn the selected algorithms in the preprocessed dataset. We have evaluated the algorithms for the four ambiguous words and achieved the best accuracy with in the range of 52 to 77.5% for Simple k-means, 67 to 83.3 for EM, 45.6 to 74.1 for Single, 65 to 73.3 for AL and 65 to 73.3 for CL clustering algorithms which is encouraging result.
Finally we achieve the best accuracy 67 to 83.3 in EM algorithm. However, we face challenges in collecting datasets, properly stemming of words and transliterating the sentences to SERA system in order to get higher accuracy. Owing that, further experiments for other ambiguous words and using different approaches needed to better natural language understanding of Tigrigna language.