The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


A New Approach to Automatically Find and Fix

Dependency Parsing (DP) is the existence of sub-term/upper-term relations between the words that make up that sentence for each sentence in the text. DP serves to produce meaningful information for high-level applications. Correct labeling of the text corpus used in DP studies is very important. There will be mistakes in the results of the studies that will be performed with the wrongly-labeled text corpus. If text corpus is labeled manually or automatically by human beings, then faulty cases will occur. As a result of the cases that may arise from human factors or annotations used for labeling, faulty labels will be on treebanks. In order to prevent these errors, detection, and correction of possible faulty labeling is very important in terms of increasing the accuracy of the studies to be carried out. Manual correction of possible faulty labels requires great effort and time. The purpose of this study is to create a model that automatically finds possible faulty labels and offers new label suggestions for faulty labels. With the help of the proposed model, it is aimed to detect and correct possible faulty labels that are included in a text corpus, and to increase consistency among the text corpus of the same language. With the help of the developed model, suggesting new labels for faulty labels by a language expert will be a great convenient for the specialist. Another advantage of the model is that the developed model provides a language-independent structure. It has succeeded in obtaining successful results in finding and correcting potentially faulty labels in experimental studies for Turkish. An increase in accuracy has been detected in studies carried out for languages other than Turkish. In investigating the accuracy of the results obtained by the system, the results were analyzed with the help of 10 different language experts.


[1] Ambati B., Gupta M., Husain S., and Sharma D., “A High Recall Error Identification Tool for Hindi Treebank Validation,” in Proceedings of the International Conference on Language Resources and Evaluation, Valletta, pp. 682-686, 2010.

[2] Ambati B., Agarwal R., Gupta M., Husain S., and Sharma D., “Error Detection for Treebank Validation,” in Proceedings of 9th International Workshop on Asian Language Resources, Chiang, pp. 23-30, 2011.

[3] Bilgin M. and Köktaş H., “Sentiment Analysis with Term Weighting and Word Vectors,” The International Arab Journal of Information Technology, vol. 16, no. 5, pp. 953-959, 2019.

[4] Boyd A., Dickinson M., and Meurers W., “On Detecting Errors in Dependency Treebanks,” Research on Language and Computation, vol. 6, no. 2, pp. 113-137, 2008.

[5] Bryant C., Felice M., and Briscoe T., “Automatic Annotation And Evaluation of Error Types for Grammatical Error Correction,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, pp. 793- 805, 2017.

[6] Chun J., Han N., Hwang J., and Choi J., “Building Universal Dependency Treebanks in Korean,” in Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, pp. 2194-2202, 2018.

[7] Cohen J., “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37-46, 1960.

[8] Çöltekin Ç., “A Grammar-Book Treebank of Turkish,” in Proceedings of the 14th workshop on Treebanks and Linguistic Theories, Warsaw, pp. 35-49, 2015.

[9] Dale R. and Kilgarriff A., “Helping our Own: The HOO 2011 Pilot Shared Task,” in Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, Nancy, pp. 242- 249, 2011.

[10] Dale R., Anisimoff I., and Narroway G., “A Report on The Preposition and Determiner Error A New Approach to Automatically Find and Fix Erroneous Labels in Dependency ... 363 Correction Shared Task,” in Proceedings of the 7th Workshop on the Innovative Use of NLP for Building Educational Applications, Montreal, pp. 54-62, 2012.

[11] De Marneffe M., MacCartney B., and Manning C., “Generating Typed Dependency Parses from Phrase Structure Parses,” in Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, pp. 449-454, 2006.

[12] De Marneffe M. and Manning C., “The Stanford Typed Dependencies Representation,” in Proceedings of the workshop on Cross- Framework and Cross-Domain Parser Evaluation Association for Computational Linguistic, Manchester, pp. 1-8, 2008.

[13] De Marneffe M., Dozat T., Silveira N., Haverinen K., Nivre J., and Manning C., “Universal Stanford Dependencies: A Cross- Linguistic Typology,” in Proceedings of the 9th International Conference on Language Resources and Evaluation, Reykjavik, pp. 4585- 4592, 2014.

[14] Del Río I., Antunes S., Mendes A., and Janssen M., “Towards Error Annotation in a Learner Corpus of Portuguese,” in Proceedings of the Joint Workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition, Umea, pp. 8-17, 2016.

[15] Diaz-Negrillo A. and Fernández-Domíguez J., “Error Tagging Systems for Learner Corpora,” Revista Española De Lingüística Aplicada, vol. 19, no. 83, pp. 83-102, 2006.

[16] Dickinson M., “Detection of Annotation Errors in Corpora,” Language and Linguistics Compass vol. 9, no. 3, pp. 119-138, 2015.

[17] Dickinson M. and Meurers W., “Detecting Inconsistencies in Treebank,” in Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories, Växjö, pp. 45-56, 2003.

[18] Dickinson M. and Meurers W., “Detecting Errors in Discontinuous Structural Annotation,” in Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, pp. 322-329, 2005.

[19] Dickinson M. and Smith A., “Simulating Dependencies to Improve Parse Error Detection,” in Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories, Bloomington, pp. 76-88, 2017.

[20] Droganova K., Lyashevskaya O., and Zeman D., “Data Conversion and Consistency of Monolingual Corpora: Russian UD Treebanks,” in Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories, Oslo, pp. 52-65, 2018.

[21] Eisner J., “Three New Probabilistic Models For Dependency Grammar,” in Proceedings of the 6th International Conference on Computational Linguistics, Stroudsburg, pp. 340-345, 1996.

[22] Eryiğit G., “Dependency Parsing of Turkish,” PhD Thesis, İstanbul Technic University, 2006.

[23] Grundkiewicz R. and Junczys-Dowmunt M., “The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and its Application to Grammatical Error Correction,” in Proceedings of in International Conference on Natural Language Processing, Warsaw, pp. 478-490, 2014.

[24] Hovy D., Berg-Kirkpatrick T., Vaswani A., and Hovy E., “Learning Whom to Trust with MACE,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, pp. 1120-1130, 2013.

[25] Ng H., Wu S., Wu Y., Hadiwinoto C., and Tetreault J., “The CoNLL- 2013 shared task on Grammatical Error Correction,” in Proceedings of the 7th Conference on Computational Natural Language Learning: Shared Task, Sofia, pp. 1- 12, 2013.

[26] Ng H., Wu S., Briscoe T., Hadiwinoto C., Susanto R., and Bryant C., “The CoNLL-2014 Shared Task on Grammatical Error Correction,” in Proceedings of the 8th Conference on Computational Natural Language Learning: Shared Task, Baltimore, pp. 1-14, 2014.

[27] Nivre J., Hall J., and Nilsson J., “Memory-based Dependency Parsing,” in Proceedings of the 8th Conference on Computational Natural Language Learning, Boston, pp. 49-56, 2004.

[28] Noh Y., Han J., Oh T., and Kim H., “Enhancing Universal Dependencies for Korean,” in Proceedings of the 2nd Workshop on Universal Dependencies, Brussels, pp. 108-116, 2018.

[29] Oflazer K., Say B., Hakkani-Tür D., and Tür G., Treebanks, Springer Dordrecht, 2003.

[30] Petrov S., Das D., and McDonald R., “A Universal Part-of-Speech Tagset,” in Proceedings of the 8 International Conference on Language Resources and Evaluation, Istanbul, pp. 2089-2096, 2012.

[31] Rehbein I. and Ruppenhofer J., “Sprucing up the Trees-Error Detection in Treebanks,” in Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, pp. 107- 118, 2018.

[32] Sulubacak U., Eryiğit G., and Pamay T., “IMST: A Revisited Turkish Dependency Treebank,” in Proceedings of 1st International Conference on Turkic Computational Linguistics, Konya, pp. 1- 6, 2016.

[33] Sulubacak U., Gökırmak M., Tyers F., Çöltekin Ç,, Nivre J., and Eryiğit G., “Universal Dependencies for Turkish,” in Proceedings of the 26th International Conference on Computational 364 The International Arab Journal of Information Technology, Vol. 18, No. 3, May 2021 Linguistics: Technical Papers, Osaka, pp. 3444- 3454, 2016.

[34] Sulubacak U. and Eryiğit G., “Implementing Universal Dependency, Morphology and Multiword Expression Annotation Standards for Turkish Language Processing,” Turkish Journal of Electrical Engineering and Computer Sciences, vol. 26, no. 3, pp. 1662-1672, 2018.

[35] Tesnière L., Introduction A la SyntaxeStructurale, Klincksieck Press, 1959.

[36] Tezcan A., Hoste V., and Macken L., “Detecting Grammatical Errors in Machine Translation Output Using Dependency Parsing and Treebank Querying,” Baltic Journal Modern Computing, vol. 4, no. 2, pp. 203-217, 2016.

[37] Treebanks url {https:// universaldependencies.org/}, Last Visited, 2020.

[38] Zeman D., “Reusable Tagset Conversion Using Tagset Drivers,” in Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, pp. 213- 218, 2008. Metin Bilgin received the Ph.D. degree in Computer Engineering from Yıldız Technical University in 2015. Also, he did research post-doc in the Computational Linguistic department at Uppsala University for about 10 months.He is currently assistant professor in the Department of Computer Engineering, Bursa Uludağ University, Turkey. His current research interests include machine learning, natural language processing, and text classification.