Abstract:
The task of identifying and classifying proper nouns in natural language is fundamental in
most Named Entity Recognition (NER) systems. NER has received much attention, as it
performs the basic building block of any Information Extraction system. Although identifying
and classifying proper nouns in text is a very challenging task in English, the task benefits a
great deal from the unique orthographic feature of capitalization. When this feature is
missing, as in uppercase text, or is present at the start of a sentence, ambiguity increases, and
requires more knowledge sources to resolve the problem.
The absence of capitalization is, however, an essential feature of Amharic texts, thus the NER
task in Amharic becomes instantly harder than in English. The ambiguity caused by this
feature is moreover increased, as most Amharic proper nouns are vague from forms that are
common nouns, verbs and adjectives. Thus, a lookup approach depend on proper noun
dictionaries would not be an appropriate way to tackle the problem, as ambiguous tokens that
fall in this category are more likely to be used as non-proper nouns in text. In addition,
Amharic is a highly morphological reach language, thus posing more challenges for the NER
task.
We assume that Amharic NER is very closely bound to Part-of-Speech (POS) tagging.
However, Amharic POS taggers would normally have their worst accuracy on proper noun
tagging. Thus, we first built a POS tagging tool with a good coverage that included named
entity classes using the supervised approach and we got an accuracy of 79.0% with our
customized 15 tag sets. Then, we used a filtering technique to help collect unique proper
nouns from large gazetteers. Combined with the POS, gazetteer, and unique features, we
defined and used a further set of features to build a supervised NER classifier from IOB
format labelled data and we got 94.2% of F-Measure.
Experiments on different datasets, against a baseline and integrating different combinations
of features, resulted in demonstrating the efficiency of our final set of proposed features. The
unique names list moreover assisted in particular, the POS feature’s noise on proper nouns.
Evaluation of our approach shows that it performs well as a beginner when we compared to
English and also it is easier to deploy for practical use.
Finally, we developed ANER application prototype with python scripting language that the
end users can interact with our system easily.