The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


A Topic-Specific Web Crawler using Deep Convolutional Networks

This paper presented a new focused crawler that efficiently supports the Turkish language. The developed architecture was divided into multiple units: a control unit, crawler unit, link extractor unit, link sorter unit, and natural language processing unit. The crawler's units can work in parallel to process the massive amount of published websites. Also, the proposed Convolutional Neural Network (CNN) based natural language processing unit can professionally classifying Turkish text and web pages. Extensive experiments using three datasets have been performed to illustrate the performance of the developed approach. The first dataset contains 50,000 Turkish web pages downloaded by the developed crawler, while the other two are publicly available and consist of “28,567” and “22,431” Turkish web pages, respectively. In addition, the Vector Space Model (VSM) in general and word embedding state-of-the-art techniques, in particular, were investigated to find the most suitable one for the Turkish language. Overall, results indicated that the developed approach had achieved good performance, robustness, and stability when processing the Turkish language. Also, Bidirectional Encoder Representations from Transformer (BERT) was found to be the most appropriate embedding for building an efficient Turkish language classification system. Finally, our experiments showed superior performance of the developed natural language processing unit against seven state-of-the-art CNN classification systems. Where accuracy improvement compared to the second-best is 10% and 47% compared to the lowest performance.

 


[1] Abd S., Alsajri M., and Ibraheem H., “Rao-SVM Machine Learning Algorithm for Intrusion Detection System,” Iraqi Journal for Computer Science and Mathematics, vol. 1, no. 1, pp. 23- 27, 2020.

[2] Aggarwal K., “An Efficient Focused Web Crawling Approach,” in Proceedings of the Software Engineering, Singapore, pp. 131-138, 2019.

[3] Akın A. and Akın M., “Zemberek, An Open Source NLP Framework for Turkic Languages,” Structure, vol. 10, no. 2007, pp. 1-5, 2007.

[4] Ali A., Hussain Z., and Abd S., “Big Data Classification Efficiency Based on Linear Discriminant Analysis,” Iraqi Journal for Computer Science and Mathematics, vol. 1, no. 1, pp. 9-14, 2020.

[5] Alqaraleh S., “Novel Turkish Sentiment Analysis System Using ConvNet,” The International Arab Journal of Information Technology, vol. 18, no. 4, pp. 554-561, 2021.

[6] Alqaraleh S., Ramadan O., and Salamah M., “Efficient Watcher Based Web Crawler Design,” Aslib Journal of Information Management, vol. 67, no. 6, pp. 663-686, 2015.

[7] Bai S., Kolter J., and Koltun V., “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,” arXiv, preprint arXiv:1803.01271, 2018.

[8] Biricik G., Diri B., and Sönmez A., “Impact of a New Attribute Extraction Algorithm on Web Page Classification,” in Proceedings of the International Conference on Data Mining, Las Vegas, pp. 481-485, 2009.

[9] Ciftci B. and Apaydin M., “A Deep Learning Approach To Sentiment Analysis in Turkish,” in Proceedings of the International Conference on Artificial Intelligence and Data Processing, Malatya, pp. 1-5, 2018.

[10] Conneau A., Schwenk H., Barrault L., and Lecun Y., “Very Deep Convolutional Networks for Text Classification,” arXiv, preprint arXiv:1606.01781, 2016.

[11] Demirci G., Keskin Ş., and Doğan G., “Sentiment Analysis in Turkish with Deep Learning,” in Proceedings of the IEEE International Conference on Big Data (Big Data), Los Angeles, pp. 2215-2221, 2019.

[12] Devlin J., Chang M., Lee K., and Toutanova K., “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” arXiv, preprint arXiv:1810.04805, 2018.

[13] Hüsem S. and Gülcü A., “Categorizing the Turkish Web Pages by Data Mining Techniques,” in Proceedings of the International Conference on Computer Science and Engineering, Antalya, pp. 255-260, 2017.

[14] Kim Y., “Convolutional Neural Networks for Sentence Classification,” Arxiv Preprint, Arxiv:1408.5882, 2014.

[15] Kumar M, Bindal A, Gautam R, Bhatia R., “Keyword Query Based Focused Web Crawler,” Procedia Computer Science, vol. 125, no. 18, pp. 584-590, 2018.

[16] Kurt F., “Investigating the Performance of Segmentation Methods with Deep Learning Models for Sentiment Analysis on Turkish Informal Texts,” Master Theses, Middle East Technical University, 2018.

[17] Lee J., Bae D., Kim S., Kim J., and Yi M., “An Effective Approach to Enhancing a Focused .00 .100 .200 .300 .400 .500 .600 .700 .800 .900 1.00 Accura cy Precision Recall F1 318 The International Arab Journal of Information Technology, Vol. 20, No. 3, May 2023 Crawler Using Google,” The Journal of Supercomputing, pp. 1-18, 2019.

[18] Li H., Zhang Z., and Xu Y., “Web Page Classification Method Based on Semantics and Structure,” in Proceedings of the 2nd International Conference on Artificial Intelligence and Big Data, Chengdu, pp. 238- 243, 2019.

[19] Liu D., Lee J., Wang W., and Wang Y., “Malicious Websites Detection via CNN based Screenshot Recognition,” in Proceedings of the International Conference on Intelligent Computing and Its Emerging Applications, Tainan, pp. 115-119, 2019.

[20] Pant G., Srinivasan P., and Menczer F., “Crawling the web,” Web Dynamics: Adapting to Change in Content, Size, Topology and Use, vol. 153, pp. 153-177, 2004.

[21] Peters M., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., and Zettlemoyer L., “Deep Contextualized Word Representations,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, pp. 2227-2237, 2018.

[22] Sekhar S., Siddesh G., Manvi S., and Srinivasa K., “Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources,” Cybernetics and Information Technologies, vol. 19, no. 2, pp. 146-158, 2019.

[23] Wang C. and Chen Y., “Topical Classification of Domain Names Based on Subword Embeddings,” Electronic Commerce Research and Applications, vol. 40, pp. 100961, 2020.

[24] Yadav D., Sharma A., and Gupta J., “Users Search Trends on WWW and Their Analysis,” in Proceedings of the 1st International Conference on Intelligent Interactive Technologies and Multimedia, Allahabad, pp. 59-66, 2010.

[25] Yang Z., Dai Z., Yang Y., Carbonell J., Salakhutdinov R., and Le Q., “Xlnet: Generalized Autoregressive Pretraining for Language Understanding,” Advances in Neural Information Processing Systems, vol. 32, pp. 5753-5763, 2019.

[26] Yildirim S., Deep Learning-Based Approaches for Sentiment Analysis, Springer, 2020.

[27] Zhang X., Zhao J., and LeCun Y., “Character- Level Convolutional Networks for Text Classification,” Advances in Neural Information Processing Systems, vol. 28, pp. 649-657, 2015.

[28] Zhang Y. and Wallace B., “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification,” arXiv, preprint arXiv:1510.03820, 2015.

[29] Zhao Q., Yang W., and Hua R., “Design and Research of Composite Web Page Classification Network Based on Deep Learning,” in Proceedings of the IEEE 31st International Conference on Tools with Artificial Intelligence, Portland, pp. 1531-1535, 2019.