Automatic POS tagging of Arabic words using the YAMCHA machine-learning tool
Abstract
Part of speech tagging of Arabic words is a morphological tagging of each word with the part of speech that is suitable for it. This process is a basic step in most natural language processing (NLP) applications such as automatic summarization, information retrieval, automatic translation and other applications. The aim of this research is to present an Arabic automatic POS tagger based on a statistical system that makes advantage of machine learning systems. The machine learning system used in this research is YAMCHA (Yet Another Multipurpose CHunk Annotator), which is an open source tool that performs many language processing tasks, such as automatic morphological word picking, entity names recognition, syntactic analysis of sentences, and other linguistic tasks. YamCha uses a machine learning algorithm called Support Vector Machines, which is used to classify data very accurately and efficiently because it uses part of the data for training and learning, and it also allows changing the extent and types of linguistic information based on machine learning (feature set and window -size). Therefore, the proposed methodology requires a good amount of texts analyzed at the level of parts of speech in order to train the system on them. The size of the corpus used in the research was 100.039 words, and it was divided by 70% for training and 30% for testing. The size of the training corpus was 64,608 words, and the size of the test blog was 35,431 words. The number of part of speech tags that the system trained on and distinguished is 48 tags. The system was trained on the training corpus several times with changing the extent of the linguistic information used in the training, then analyzing the test corpus and evaluating the results in order to reach the best results in the automatic recitation of Arabic words. The lowest error rate was 11.4%, and it was in the case of considering the previous word in the analysis without looking at its morphological title (F:-1..0:0..).
Copyright (c) 2023 International Online Journal of Language, Communication, and Humanities
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
It is the author's responsibility to ensure that his or her submitted work does not infringe any existing copyright. Furthermore, the author indemnifies the editors and publisher against any breach of such a warranty. Authors should obtain permission to reproduce or adapt copyrighted material and provide evidence of approval upon submitting the final version of a manuscript. This journal does not allow the author(s) to hold the copyright without restrictions or retain publishing rights without restrictions.
Papers are accepted on the understanding that they have not been and will not be published elsewhere. On the decision of the editors, authorities in the relevant field will review the papers blindly. The editors have the final decision on publication. Papers on acceptance become the copyright of UMK Publisher. To assist publication, authors will be requested to submit a copy in Microsoft Word format of their final manuscript. The journal will be published in online.