The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


A New Vector Representation of Short Texts for Classification

Short and sparse characteristics and synonyms and homonyms are main obstacles for short-text classification. In recent years, research on short-text classification has focused on expanding short texts but has barely guaranteed the validity of expanded words. This study proposes a new method to weaken these effects without external knowledge. The proposed method analyses short texts by using the topic model based on Latent Dirichlet Allocation (LDA), represents each short text by using a vector space model and presents a new method to adjust the vector of short texts. In the experiments, two open short- text data sets composed of google news and web search snippets are utilised to evaluate the classification performance and prove the effectiveness of our method.


[1] Baker L. and McCallum A., “Distributional Clustering of Words for Text Classification,” in Proceedings of The 21st Annual International ACM C Conference on Research and Development in Information Retrieval, Melbourne, pp. 96-103, 1998.

[2] Banerjee S., Ramanathan K., and Gupta A., “Clustering Short Texts Using Wikipedia,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, pp. 787-788, 2007.

[3] Bengio Y., Ducharme R., Vincent P., and Jauvin C., “A Neural Probabilistic Language Model,” Journal of Machine Learning Research, vol. 3, no. 2, pp. 1137-1155, 2003.

[4] Blei D., Ng A., and Jordan M., “Latent Dirichlet Allocation,” Journal of Machine Learning Research, vol. 3, no. 1, pp. 993-1022, 2003.

[5] Bollegala D., Matsuo Y., and Ishizuka M., “Measuring Semantic Similarity Between Words Using Web Search Engines” in Proceedings of the 16th International Conference on World Wide Web, Banff, pp. 757-766, 2007.

[6] Chen M., Jin X., and Shen D., “Short Text Classification Improved by Learning Multi- Granularity Topics,” in Proceedings of International Joint Conference on Artificial Intelligence, Barcelona, pp. 1776-1781, 2011.

[7] Collobert R., Weston J., Bottou L., Karlen M., Kavukcuoglu K., and Kuksa p., “Natural Language Processing (Almost) from Scratch,” Journal of Machine Learning Research, vol. 12, no. 8, pp. 2493-2537, 2011.

[8] Griffiths T., “Gibbs Sampling in The Generative Model of Latent Dirichlet Allocation,” Standford University, vol. 518, no. 11, pp. 1-3, 2002.

[9] Griffiths T. and Steyvers M., “Finding Scientific Topics,” in Proceedings of the National Academy of Sciences, United States of America, pp. 5228- 5235, 2004.

[10] Hofmann T., “Probabilistic Latent Semantic Analysis,” in Proceedings of the 5th conference on Uncertainty in Artificial Intelligence, Stockholm, pp. 289-196, 1999.

[11] Hu X., Sun N., Zhang C., and Chua T., “Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge,” in Proceedings of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, pp. 919-928, 2009.

[12] Ko Y., “A Study of Term Weighting Schemes Using Class Information for Text Classification,” in Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, Portland, pp. 1029- 1030, 2012.

[13] Martineau J. and Finin T., “Delta TFIDF: An Improved Feature Space for Sentiment Analysis,” in Proceedings of the 3th International Conference on Weblogs and Social Media, California, pp. 17-20, 2009. A New Vector Representation of Short Texts for Classification 249

[14] Metzler D., Dumais S., and Meek C., “Similarity Measures for Short Segments of Text,” in Proceedings of European Conference on Information Retrieval, Rome, pp. 16-27, 2007.

[15] Mikolov T., Chen K., Corrado G., and Dean J., “Efficient Estimation of Word Representations in Vector Space,” in Proceedings of the International Conference on Learning Representations, Scottsdale, pp. 1-12, 2013.

[16] Minka T. and Lafferty J., “Expectation- Propagation for The Generative Aspect Model,” in Proceedings of the 8th Conference on Uncertainty in Artificial Intelligence, Alberta, pp. 352-359, 2002.

[17] Pennington J., Socher R., and Manning C., “Glove: Global Vectors for Word Representation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, pp. 1532-1543, 2014.

[18] Pereira F., Tishby N., and Lee L., “Distributional Clustering of English Words,” in Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, Columbus, pp. 183- 190, 1993.

[19] Phan X., Nguyen L., and Horiguchi S., “Learning to Classify Short and Sparse Text and Web with Hidden Topics from Large-Scale Data Collections,” in Proceedings of the 17th International Conference on World Wide Web, Beijing, pp. 91-100, 2008.

[20] Phan X., Nguyen C., Nguyen L., Horiguchi S., and Ha Q., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7, pp. 961- 976, 2011.

[21] Pinto D., Rosso P., and Jiménez-Salazar H., “A Self-Enriching Methodology for Clustering Narrow Domain Short Texts,” The Computer Journal, vol. 54, no. 7, pp. 1148-1165, 2011.

[22] Sahami M. and Heilman T., “A Web-Based Kernel Function for Measuring The Similarity of Short Text Snippets,” in Proceedings of the 15th International Conference on World Wide Web, Edinburgh, pp. 377-386, 2006.

[23] Salton G., Wong A., and Yang C., “A Vector Space Model for Automatic Indexing,” Communications of the Association for Computing Machinery, vol. 18, no. 11, pp. 613- 620, 1975.

[24] Singh S. and Siddiqui T., “Utilizing Corpus Statistics for Hindi Word Sense Disambiguation,” The International Arab Journal of Information Technology, vol. 12, no. 6A, pp. 755-763, 2015.

[25] Sinoara R., Collados J., Rossi R., Navigli R., and Rezende S., “Knowledge-Enhanced Document Embeddings for Text Classification,” Knowledge-Based Systems, vol. 163, pp. 955- 971, 2019.

[26] Song Y, Wang H, Wang Z, Li H., and Chen W., “Short Text Conceptualization Using A Probabilistic Knowledgebase,” in Proceedings of the 22th International Joint Conference on Artificial Intelligence, Barcelona, pp. 16-22, 2011.

[27] Vo D. and Ock C., “Learning to Classify Short Text from Scientific Documents Using Topic Models with Various Types of Knowledge,” Expert Systems with Applications, vol. 42, no. 3, pp. 1684-1689, 2015.

[28] Wang P., Xu J., Xu B., Liu C., Zhang H., Wang F., and Hao H., “Semantic Clustering and Convolutional Neural Network for Short Text categorization,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, pp. 352-357, 2015.

[29] Yih W. and Meek C., “Improving Similarity Measures for Short Segments of Text,” in Proceedings of the 20nd AAAI Conference on Artificial Intelligence, Vancouver, pp. 1489- 1494, 2007.

[30] Zhang H. and Zhong G., “Improving Short Text Classification by Learning Vector Representations of Both Words and Hidden Topics,” Knowledge-Based Systems, vol. 102, pp. 76-86, 2016.

[31] Zheng C., Liu C., and Wong H., “Corpus-Based Topic Diffusion for Short Text Clustering,” Neurocomputing, vol. 275, pp. 2444-2458, 2018.

[32] Zhu Y., Li L., and Luo L., “Learning To Classify Short Text with Topic Model and External Knowledge,” in Proceedings of International Conference on Knowledge Science, Engineering and Management, Dalian, pp. 493-503, 2013. Yangyang Li received a master's degree in computer science and technology from Jinan University in 2019. Her main research contents are: data mining, machine learning, natural language processing. Bo Liu received her master's degree in computer application from Central South University in 1991. Now she is a professor at Jinan University. Her research interests include data mining, information retrieval, natural language processing, etc.