Abstract:
Music is a social language that joins communities together. Music in Ethiopia started as a means
of religious expression with the advent of the Ethiopian Orthodox Church. In Ethiopia music is a
reflection of different cultural, historical, and social episodes. Music is keeping in progress due to
advancements in technology. Cultural style Music classification is an important branch in the field
of music information retrieval that has to be studied in depth because the automatic algorithm of
music category has practical value. In this research work, we proposed an automatic bi-modal
classification model for the traditional music of Ethiopia using deep learning techniques given
different data modalities. Experiments on single modality music genre categories are also carried
out, and their combinations. Results of both trials show how the conglomeration of learned
representations from diverse modalities improves the delicacy of the classification. Still, the lack
of a labeled dataset is a major problem. Therefore, we first begin with developing a music video
dataset including differences in territory, language, culture, and melodic disobedient or musical
instruments. We train and test this dataset over two unimodal and one bimodal model. First, we
separately train a unimodal CNN-based and parallel RNN music classifier with audio spectrogram
input. Then we train the video model using extracted frames from the video music with no more
than 27 frames that represent the video , time continuity information itself applying 3D-CNN. The
relative analysis of each unimodal classifier over various optimizers is made to find a stylish model
that can be integrated into a multimodal structure. The stylish unimodal modality is integrated with
corresponding music and video network features under a bimodal classifier. The bi-modal structure
incorporates all music video features and uses a late feature fusion strategy to classify them with
the SoftMax classifier. To generate the overall prediction, all probable uni-modal structures are
merged into one predictive model. The results of the evaluations using numerous measures
demonstrate that multimodal structures outperform each unimodal emotion classifier. The
evaluation results using various metrics show a boost in the performance of the bimodal
architectures compared to each unimodal music classifier. The predictive model by integration of
all multimodal structures achieves 97.9% accuracy, 90% in the f1-score, and 94.7 in the area under
the curve (AUC) score.
Keywords: Music classification, bi-modal, uni-modal, spectrograms, frame, late feature fusion,
3D-CNN