Abstract:
Amharic, one of the Semitic languages, is the working language of the federal democratic republic of Ethiopia. Amharic words are often ambiguous with respect to part of speech. Part of Speech (POS) tagging is used to resolve this ambiguity and assign the exact part of speech for each word based on its morphological information and syntactic structure. Therefore, it is an important component for a number of natural language processing (NLP) applications. Various POS tagging tools are based on supervised machine learning algorithms. They need training data sets annotated manually which is a big obstacle for under resourced languages like Amharic. Therefore, in this study, an attempt is made to apply unsupervised learning based POS tagger for Amharic language, which intends to solve the problem of large amount of manually annotated data set, as it requires only unlabeled collection of texts.
In order to conduct this research mixed (quantitative and qualitative) research design approaches were used. A training data set of 929,526 sentences is collected using web crawler [10] and from [2, 56]. Preprocessing is taken care of using Python 3 and the unsupos tagger was applied before and after normalizing the cleaned data set. The features considered by the unsupos to represent each word are frequency, distributional similarity, neighboring co-occurrence and morphological information. Finally, recall, precision and accuracy are used to evaluate the result generated by the unsupos tagger.
Based on the experimental result, an accuracy of 66.98% is obtained using 37 sentences of the test data set and 70.25% is obtained using 44 sentences of the test data set considering 11 higlevel tags. It is a promising result as it is achieved using the training data containing many infrequent words. In addition to this, even though fine grained tags are not measured due to the inavailability of test data for fine grained tags, many of them are identified by the tagger