AMHARIC NAMED ENTITY RECOGNITION [ANER]: USING SUPERVISED APPROACH

AMHARIC NAMED ENTITY RECOGNITION [ANER]: USING SUPERVISED APPROACH

AGIZEW T/MARYAM, SOLOMON

URI: http://hdl.handle.net/123456789/559

Date: 2015-03-22

Abstract:

The task of identifying and classifying proper nouns in natural language is fundamental in most Named Entity Recognition (NER) systems. NER has received much attention, as it performs the basic building block of any Information Extraction system. Although identifying and classifying proper nouns in text is a very challenging task in English, the task benefits a great deal from the unique orthographic feature of capitalization. When this feature is missing, as in uppercase text, or is present at the start of a sentence, ambiguity increases, and requires more knowledge sources to resolve the problem. The absence of capitalization is, however, an essential feature of Amharic texts, thus the NER task in Amharic becomes instantly harder than in English. The ambiguity caused by this feature is moreover increased, as most Amharic proper nouns are vague from forms that are common nouns, verbs and adjectives. Thus, a lookup approach depend on proper noun dictionaries would not be an appropriate way to tackle the problem, as ambiguous tokens that fall in this category are more likely to be used as non-proper nouns in text. In addition, Amharic is a highly morphological reach language, thus posing more challenges for the NER task. We assume that Amharic NER is very closely bound to Part-of-Speech (POS) tagging. However, Amharic POS taggers would normally have their worst accuracy on proper noun tagging. Thus, we first built a POS tagging tool with a good coverage that included named entity classes using the supervised approach and we got an accuracy of 79.0% with our customized 15 tag sets. Then, we used a filtering technique to help collect unique proper nouns from large gazetteers. Combined with the POS, gazetteer, and unique features, we defined and used a further set of features to build a supervised NER classifier from IOB format labelled data and we got 94.2% of F-Measure. Experiments on different datasets, against a baseline and integrating different combinations of features, resulted in demonstrating the efficiency of our final set of proposed features. The unique names list moreover assisted in particular, the POS feature’s noise on proper nouns. Evaluation of our approach shows that it performs well as a beginner when we compared to English and also it is easier to deploy for practical use. Finally, we developed ANER application prototype with python scripting language that the end users can interact with our system easily.

Show full item record