Performance Evaluation of Keyword Extraction Techniques and Stop Word Lists on Speech-To-Text Corpus

Author Blessed Guda, Bello Kontagora Nuhu, James Agajo, Ibrahim Aliyu,

Keywords #natural language processing #RAKE #textrank #stoplist #speech recognition

Abstract

The dawn of conversational user interfaces, through which humans communicate with computers through voice audio, has been reached. Therefore, Natural Language Processing (NLP) techniques are required to focus not only on text but also on audio speeches. Keyword Extraction is a technique to extract key phrases out of a document which can provide summaries of the document and be used in text classification. Existing keyword extraction techniques have commonly been used on only text/typed datasets. With the advent of text data from speech recognition engines which are less accurate than typed texts, the suitability of keyword extraction is questionable. This paper evaluates the suitability of conventional keyword extraction methods on a speech-to-text corpus. A new audio dataset for keyword extraction is collected using the World Wide Web (WWW) corpus. The performances of Rapid Automatic Keyword Extraction (RAKE) and TextRank are evaluated with different Stoplists on both the originally typed corpus and the corresponding Speech-To-Text (STT) corpus from the audio. Metrics of precision, recall, and F1 score was considered for the evaluation. From the obtained results, TextRank with the FOX Stoplist showed the highest performance on both the text and audio corpus, with F1 scores of 16.59% and 14.22%, respectively. Despite lagging behind text corpus, the recorded F1 score of the TextRank technique with audio corpus is significant enough for its adoption in audio conversation without much concern. However, the absence of punctuation during the STT affected the F1 score in all the techniques.

References

[1] Al-Jarrah A., Al-Jarrah M., and Albsharat A., “Dictionary Based Arabic Text Compression and Encryption Utilizing Two-Dimensional Random Binary Shuffling Operations,” The International Arab Journal of Information Technology, vol. 19, no. 6, pp. 861-872, 2022.

[2] Arts S., Hou J., and Gomez J., “Natural Language Processing to Identify the Creation and Impact of New Technologies in Patent Text: Code, Data, And New Measures,” Research Policy, vol. 50, no. 2, pp. 104144, 2021.

[3] Bird S., Klein E., and Loper E., Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, O'Reilly Media Inc, 2009.

[4] Blei D., Ng A., and Jordan M., “Latent Dirichlet Allocation,” Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.

[5] Campos R., Mangaravite V., Pasquali A., Jorge A., Nunes C., and Jatowt A., “Yake! Keyword Extraction from Single Documents Using Multiple Local Features,” Information Sciences, vol. 509, pp. 257-289, 2020.

[6] Google Assistant, your own personal Google, https://assistant.google.com/, Last Visited, 2022.

[7] Guda B., Bello Kontagora N., Agajo J., and Aliyu I., STT Dataset, https://drive.google.com/file/d/1N3l4VifpV5BFex Yxg9mDaCu1iY0LcFHt/view, Last Visited, 2022.

[8] Këpuska V. and Bohouta G., “Comparing Speech Recognition Systems (Microsoft API, Google API and CMU Sphinx),” International Journal of Engineering Research and Applications, vol. 7, no. 03, pp. 20-24, 2017.

[9] Kim Y., Lee J., Choi S., Lee J., Kim J., Seok J., and Joo H., “Validation of Deep Learning Natural Language Processing Algorithm for Keyword Extraction from Pathology Reports in Electronic Health Records,” Scientific Reports, vol. 10, no. 1, pp. 1-9, 2020.

[10] Koizumi Y., Masumura R., Nishida K., Yasuda M., and Saito S., “A Transformer-Based Audio Captioning Model with Keyword Estimation,” arXiv preprint arXiv:2007.00222, 2020.

[11] Kumbhar A., Savargaonkar M., Nalwaya A., Bian C., and Abouelenien M., “Keyword Extraction Performance Analysis,” in Proceeding of Conference on Multimedia Information Processing and Retrieval, San Jose, pp. 550-553, 2019.

[12] Leung A., Evaluating Automatic Keyword Extraction for Internet Reviews, Lorraıne Realself INC, 2016.

[13] Mihalcea R. and Tarau P., “Textrank: Bringing Order Into Text,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, pp. 404-411, 2004.

[14] Pay T., “Totally Automated Keyword Extraction,” in Proceedings of IEEE International Conference on Big Data, Washington, pp. 3859-3863, 20216.

[15] Pay T. and Lucci S., “Automatic Keyword Extraction: An Ensemble Method,” in Proceedings of IEEE International Conference on Big Data, Boston, 2017.

[16] Ram A., Prasad R., Khatri C., Venkatesh A., Gabriel R., Liu Q., and et al., “Conversational Ai: the Science Behind the Alexa Prize,” arXiv preprint arXiv:1801.03604, 2018.

[17] Rose S., Engel D., Cramer N., and Cowley W., Automatic Keyword Extraction from Individual Documents, Wiley Online Library, 2010.

[18] Siddiqi S. and Sharan A., “Keyword and Keyphrase Extraction Techniques: A Literature Review,” International Journal of Computer Applications, vol. 109, no. 2, pp. 18-23, 2015.

[19] Singhal A. and Sharma D., “Keyword Extraction using Renyi Entropy: A Statistical and Domain Independent Method,” in Proceedings of 7th International Conference on Advanced Computing and Communication Systems, Coimbatore, pp. 1970-1975, 2021.

[20] Siri-Apple, https://www.apple.com/siri/, Last Visited, 2022.

[21] Timonen M., Toivanen T., Kasari M., Teng Y., Cheng C., and He L., “Keyword Extraction from Short Documents using Three Levels of Word Evaluation,” in Proceedings of International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management, Barcelona, pp. 130-146, 2012.

[22] Wang X. and Ning H., “Chinese Keyword Extraction Method Based On Context And Word Classification,” in Proceedings of International Conference on Computer Information and Big Data Applications, Guiyang, pp. 344-347, 2020. 140 The International Arab Journal of Information Technology, Vol. 20, No. 1, January 2023

[23] Yao L., Pengzhou Z., and Chi Z., “Research on News Keyword Extraction Technology Based on TF-IDF and Textrank,” in Proceedings of IEEE/ACIS 18th International Conference on Computer and Information Science, Beijing, pp. 452-455, 2019.