Feature-rich PoS Tagging through Taggers Combination : Experience in Arabic

Imad Zeroual, Abdelhak Lakhouaja

Abstract


Since words can play different syntactic roles in different contexts, it is not trivial to assign the appropriate morphosyntactic category to each word according to the context. Part of Speech (PoS) tagging is the task which manage this issue. Several probabilistic methods have been adapted for PoS tagging such as Hidden Markov Models, Support Vector Machines, and Decision Tree. Based on these methods, language-independent PoS taggers have been developed such as TnT, SVMTool, and Treetagger. The main purpose of this work is to combine automatically the output of these standard PoS taggers and investigate several options for how to do this combination. The experiments are applied to one of the morphologically complex languages, Arabic. In this paper, we highlight the use of these taggers via various experiments. In fact, the evaluations involve several tests on both Classical and Modern Standard Arabic, trained/untrained and tagged/untagged data. Finally, a deeper investigation of Arabic PoS tagging through these language-independent taggers combination is performed.


Keywords


Part of speech, Tagging, Treetagger, SVMTool, TnT, Arabic.

Full Text:

PDF

References


(1) M. Albared, N. Omar, and M. J. Ab Aziz, “Developing a competitive HMM arabic POS tagger using small training corpora,” in Intelligent Information and Database Systems, Springer, 2011, pp. 288–296.

(2) R. A. Abumalloh, H. M. Al-Sarhan, O. Ibrahim, and W. Abu-Ulbeh, “Arabic Part-of-Speech Tagging,” J. Soft Comput. Decis. Support Syst., vol. 3, no. 2, pp. 45–52, 2016.

(3) T. Brants, “TnT: a statistical part-of-speech tagger,” in Proceedings of the sixth conference on Applied natural language processing, 2000, pp. 224–231.

(4) J. Giménez and L. Marquez, “SVMTool: A general POS tagger generator based on Support Vector Machines,” 2004.

(5) H. Schmid, “Treetagger| a language independent part-of-speech tagger,” Inst. Für Maschinelle Sprachverarbeitung Univ. Stuttg., vol. 43, p. 28, 1995.

(6) H. S. Rabiee, “Adapting Standard Open-Source Resources To Tagging A Morphologically Rich Language: A Case Study With Arabic.,” in RANLP Student Research Workshop, 2011, pp. 127–132.

(7) S. Alqrainy, “A morphological-syntactical analysis approach for Arabic textual tagging,” De Montfort University, 2008.

(8) M. Sawalha, “Arabic Morphological Features Tag set.” University of Leeds, 2009.

(9) I. Zeroual, A. Lakhouaja, and R. Belahbib, “Towards a standard Part of Speech tagset for the Arabic language,” J. King Saud Univ. - Comput. Inf. Sci., vol. 29, no. 2, pp. 174–181, 2017.

(10) Z. Imad and L. Abdelhak, “Adapting a decision tree based tagger for Arabic,” in 2016 International Conference on Information Technology for Organizations Development (IT4OD), 2016, pp. 1–6.

(11) S. Petrov, D. Das, and R. McDonald, “A universal part-of-speech tagset,” ArXiv Prepr. ArXiv11042086, 2011.

(12) N. P. P. Khin and T. N. Aung, “Analyzing Tagging Accuracy of Part-of-Speech Taggers,” in International Conference on Genetic and Evolutionary Computing, 2015, pp. 347–354.

(13) V. Henrich, T. Reuter, and H. Loftsson, “CombiTagger: A System for Developing Combined Taggers.,” in FLAIRS Conference, 2009.

(14) I. Zeroual and A. Lakhouaja, “A new Quranic Corpus rich in morphosyntactical information,” Int. J. Speech Technol., pp. 1–8, Feb. 2016.

(15) M. Yaseen et al., “Building annotated written and spoken Arabic LR’s in NEMLAR project,” in Proceedings of LREC, 2006, pp. 533–538.

(16) M. Boudchiche, A. Mazroui, M. Ould Abdallahi Ould Bebah, A. Lakhouaja, and A. Boudlal, “AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer,” J. King Saud Univ. - Comput. Inf. Sci., 2016.

(17) I. Zeroual and A. Lakhouaja, “Towards a Multilingual Aligned Parallel Corpus,” presented at the International Conference of High Innovation in Computer Science, Kenitra, Morocco, 2016, pp. 1–4.

(18) M. Utvić, “Annotating the corpus of contemporary Serbian,” in Proceedings of the INFOtheca ‘12 Conference, 2011.

(19) M. Banko and R. C. Moore, “Part of speech tagging in context,” in Proceedings of the 20th international conference on Computational Linguistics, 2004, p. 556.

(20) F. Al Shamsi and A. Guessoum, “A hidden Markov model-based POS tagger for Arabic,” in Proceeding of the 8th International Conference on the Statistical Analysis of Textual Data, France, 2006, pp. 31–42.

(21) A. Kadim and A. Lazrek, “Bidirectional HMM-based Arabic POS tagging,” Int. J. Speech Technol., vol. 19, no. 2, pp. 303–312, 2016.

(22) M. Diab, K. Hacioglu, and D. Jurafsky, “Automatic tagging of Arabic text: From raw text to base phrase chunks,” in Proceedings of HLT-NAACL 2004: Short Papers, 2004, pp. 149–152.

(23) N. Habash and O. Rambow, “Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, pp. 573–580.

(24) K. Toutanova and C. D. Manning, “Enriching the knowledge sources used in a maximum entropy part-of-speech tagger,” in Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13, 2000, pp.

–70.

(25) M. Maamouri and A. Bies, “The Penn Arabic Treebank.” In Farghaly, A., Ed., Arabic Computational Linguistics. CSLI Publications, Stanford, CA., 2010.

(26) A. Freeman, “Brill’s {POS} tagger and a Morphology parser for {Arabic},” 2001.

(27) S. AlGahtani, W. Black, and J. McNaught, “Arabic part-of-speech tagging using transformation-based learning,” in Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 2009.

(28) T. Buckwalter, “Buckwalter Arabic morphological analyzer (BAMA) version 2.0. linguistic data consortium (LDC) catalogue number LDC2004L02,” ISBN1-58563-324-0, 2004.

(29) Y. El Hadj, I. Al-Sughayeir, and A. Al-Ansari, “Arabic part-of-speech tagging using the sentence structure,” in Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 2009.

(30) D. R. Wilson, “Advances in instance-based learning algorithms,” Citeseer, 1997.

(31) H. C. Carneiro, F. M. França, and P. M. Lima, “Multilingual part-of-speech tagging with weightless neural networks,” Neural Netw., vol. 66, pp. 11–21, 2015.

(32) M. Albared, T. Al-Moslmi, N. Omar, A. Al-Shabi, and F. M. Ba-Alwi, “Probabilistic Arabic part of speech tagger with unknown words handling,” J. Theor. Appl. Inf. Technol., vol. 90, no. 2, p. 236, 2016.

(33) M. Hadni, S. A. Ouatik, A. Lachkar, and M. Meknassi, “Hybrid Part-Of-Speech Tagger for Non-Vocalized Arabic Text,” Int. J. Nat. Lang. Comput., vol. 2, no. 6, pp. 1–15, 2013.

(34) N. Ababou and A. Mazroui, “A hybrid Arabic POS tagging for simple and compound morphosyntactic tags,” Int. J. Speech Technol., vol. 19, no. 2, pp. 289–302, 2016.

(35) A. H. Aliwy, “Arabic Morphosyntactic Raw Text Part of Speech Tagging System,” Repozytorium Uniwersytetu Warszawskiego, 2013.

(36) Y. Tlili-Guiassa, “Hybrid method for tagging Arabic text,” J. Comput. Sci., vol. 2, no. 3, pp. 245–248, 2006.

(37) D. L. Neuhoff, “The Viterbi algorithm as an aid in text recognition (Corresp.),” Inf. Theory IEEE Trans. On, vol. 21, no. 2, pp. 222–226, 1975.

(38) V. Vapnik, The nature of statistical learning theory. Springer Science & Business Media, 2013.

(39) V. N. Vladimir and V. Vapnik, The nature of statistical learning theory. Springer Heidelberg, 1995.

(40) T. Joachims, “Svmlight: Support vector machine,” SVM-Light Support Vector Mach. Httpsvmlight Joachims Org Univ. Dortm., vol. 19, no. 4, 1999.

(41) M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a large annotated corpus of English: The Penn Treebank,” Comput. Linguist., vol. 19, no. 2, pp. 313–330, 1993.

(42) H. Schmid, “Probabilistic part-ofispeech tagging using decision trees,” in New methods in language processing, 2013, p. 154.

(43) O. Rambow et al., “Parallel syntactic annotation of multiple languages,” in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC2006). Genoa, Italy, 2006.

(44) A. Pasha et al., “Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic,” in Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, 2014.

(45) K. Darwish, A. Abdelali, and H. Mubarak, “Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging.,” in LREC, 2014, pp. 2926–2931.




DOI: http://dx.doi.org/10.14738/tmlai.54.2981

Refbacks





______________________________________________________________________________

Transactions on Machine Learning and Artificial Intelligence; ISSN (online) 2054-7309

Copyright Society for Science and Education, United Kingdom