Downloads 1k

..............................

Views 3k

..............................

Cited by

..............................

Received date July 25, 2021

Accepted date December 13, 2021

Multi-Lingual Language Variety Identification using Conventional Deep Learning and Transfer

Author Learning Approaches, Muhammad Adnan Ashraf, Qiao Ya-nan,

Keywords #Language variety identification #deep learning #transfer learning #binary classification

Abstract Language variety identification tends to identify lexical and semantic variations in different varieties of a single language. Language variety identification helps build the linguistic profile of an author from written text which can be used for cyber forensics and marketing purposes. Investigating previous efforts for language variety identification, we hardly find any study that experiments with transfer learning approaches and/or performs a thorough comparison of different deep learning approaches on a range of benchmark datasets. So, to bridge this gap, we propose transfer learning approaches for language variety identification tasks and perform an extensive comparison of them with deep learning approaches on multiple varieties of four widely spoken languages, i.e., Arabic, English, Portuguese, and Spanish. This research has treated this task as a binary classification problem (Portuguese) and multi-class classification problem (Arabic, English, and Spanish). We applied two transfer learning Bidirectional Encoder Representations from Transformers (BERT), Universal Language Model Fine-tuning (ULMFiT), three deep learning-Convolutional Neural Networks (CNN), Bidirectional Long Short Term Memory (Bi-LSTM), Gated Recurrent Units (GRU), and an ensemble approach for identifying different varieties. A thorough comparison between the approaches suggests that the transfer learning based ULMFiT model outperforms all other approaches and produces the best accuracy results for binary and multi-class language variety identification tasks.

References

[1] Basile A., Gareth D., Maria M., Josine R., Hessel H., and Nissim M., “Is there Life beyond N- Grams? A Simple SVM-Based Author Profiling System,” in Working Notes of CLEF, Ireland, pp. 1-7, 2017.

[2] Bestgen Y., “Improving the Character Ngram Model for The DSL Task With BM25 Weighting and Less Frequently Used Feature Sets,” in Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Valencia, pp. 115-123, 2017.

[3] Chan S., Honari M., Benetti B., Lakhani A., and Fyshe A., “Ensemble Methods for Native Language Identification,” in Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, Copenhagen, pp. 217-223, 2017.

[4] Chung J., Gulcehre C., Cho K., and Bengio Y., “Empirical Evaluation of Gated Recurrent Multi-Lingual Language Variety Identification using Conventional Deep Learning ... 711 Neural Networks on Sequence Modeling,” in NIPS, 2014.

[5] Çoltekin C., Rama T., and Blaschke V., “Tübingen-Oslo Team at the VarDial 2018 Evaluation Campaign: An Analysis of N-Gram Features in Language Variety Identification,” in Proceedings of the 5th Workshop on NLP for Similar Languages, Varieties and Dialects, Santa Fe, pp. 55-65, 2018.

[6] Devlin J., Chang M., Lee K., and Toutanova K., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, United Stated of America-USA, pp. 4171-4186, 2019.

[7] Dunn J., “Mapping Languages: The Corpus of Global Language Use,” Language Resources and Evaluation, vol. 54, no. 4, pp. 999-1018, 2020.

[8] Franco-Salvador M., Rangel F., Rosso p., Taulé M., and Antònia M., “Language Variety Identification Using Distributed Representations of Words and Documents,” in Proceedings of the Conference on Cross-Language Evaluation Forum For European Languages, Toulouse, pp. 28-40, 2015.

[9] Gaman M., Dirk H., Ionescu R., Jauhiainen H., JauhiainenT., Lindén K., Ljubešić N., Partanen N., Purschke C., Scherrer Y., and Zampieri M., “A report on the VarDial Evaluation Campaign,” in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, pp. 1-14, 2020.

[10] Graves A. and Schmidhuber J., “Framewise Phoneme Classification with Bidirectional LSTM Networks,” in Proceedings IEEE International Joint Conference on Neural Networks, Montreal, pp. 2047-2052, 2005.

[11] Hochreiter S., and Schmidhuber J., “Long Short- Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.

[12] Issa H., Issa S., and Shah W., “A Novel Method for Gender and Age Detection Based on EEG Brain Signals,” The International Arab Journal of Information Technology, vol. 18, no. 5, pp. 704- 710, 2021.

[13] Jeremy H. and Ruder S., “Universal Language Model Fine-Tuning for Text Classification,” in Proceedings of 56th Annual Meeting of the Association for Computational Linguistics, Australia, pp. 328-339, 2018.

[14] Kalchbrenner N., Grefenstette E., and Blunsom P., “A Convolutional Neural Network for Modelling Sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, pp. 655- 665, 2014.

[15] Lee C. and Bosch A., “Exploring Lexical and Syntactic Features for Language Variety Identification,” in Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Valencia, pp. 190-199, 2017.

[16] Merity S., Keskar S., and Socher R., “Regularizing and Optimizing LSTM Language Models,” in Proceedings of International Conference on Learning Representations, pp. 1-8, 2018.

[17] Miura Y., Taniguchi T., Taniguchi M., Misawa S., and Ohkuma T., “Using Social Networks to Improve Language Variety Identification with Neural Networks,” in Proceedings of the 8th International Joint Conference on Natural Language Processing, Taipei, pp. 263-270, 2017.

[18] Rangel F., Rosso P., Potthast M., and Stein B., “Overview of the 5th Author Profiling Task at pan 2017: Gender and Language Variety Identification in Twitter,” in CLEF, Ireland, pp. 1-26, 2017.

[19] Rangel F., Franco-SalvadorM., and Rosso P. “A Low Dimensionality Representation for Language Variety Identification,” in Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics, Konya, pp. 156-169, 2016.

[20] Sierra S., Montes-y-Gómez M., Solorio T., and González F., “Convolutional Neural Networks for Author Profiling,” in CLEF, Ireland, pp. 1-7, 2017.

[21] Zaghouani W., and Charfi A., “Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification,” in Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, pp 1-18, 2018.

[22] Zampieri M., Nakov P., and Scherrer Y., “Natural Language Processing for Similar Languages, Varieties, and Dialects: Survey,” Natural Language Engineering, vol. 26, no. 6, pp. 595-612, 2020.

[23] Zampieri M. and Gebre B., “Automatic identification of language varieties: The case of Portuguese,” in Proceedings of 11th Conference on Natural Language Processing, Vienna, pp. 233-237, 2012.

[24] Zampieri M., Malmasi S., Ljubešić N., Nakov P., Ali A., Tiedemann J., Scherrer Y., and Aepli N., “Findings of the VarDial Evaluation Campaign 2017,” in Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Valencia, pp 1-15, 2017.

[25] Zampieri M., Malmasi S., Nakov P., Ali A., Shon S., and et al., “Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign, in Proceedings of the 5th Workshop on NLP for Similar Languages, 712 The International Arab Journal of Information Technology, Vol. 19, No. 5, September 2022 Varieties and Dialects, Santa Fe, pp. 1-17, 2018.

[26] Zampieri M., Malmasi S., Scherrer Y., Samardžić T., Tyers F., and et al. “A Report on The Third Vardial Evaluation Campaign,” in Proceedings of the 6th Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, pp 1-16, 2019. Sameeah Noreen Hameed is instructor at the East China Jiaotong University, China. She received her Master’s degree in Computer Science from Xi’an Jiaotong University (XJTU), China. Her current research interests include NLP, Information retrieval, and GIS. Muhammad Adnan Ashraf is a Ph.D. scholar at Northwestern Polytechnical University, China. He is also working as a lecturer in Computer Science at COMSATS University Islamabad, Pakistan. His current research is in NLP and information retrieval. Qiao Ya-nan is an associate professor at XJTU. He received his Ph.D. Degree in Computer Science from Xian Jiaotong University, China. His current research is in block chain, cloud computing, information retrieval, text mining and social network analysis. His research has been financially supported by National key R&D Program of China, National Natural Science Foundation of China, etc.