Hierarchical Method for Automated Text Documents Classification
Digitalization is currently not a concept the world seeks to apply; rather, it is a fact this world lives in. The transformation for the green world has strongly introduced the principle of eliminating hard copy resources while maintaining their digital versions. The immense amount of information that resides in electronic documents opened a wide road for research a long time ago. On the other hand, information extraction, text mining, and Natural Language Processing (NLP) are three concatenated fields that have gained their unique place in the digital world through time. This research aims to introduce a novel method for Arabic document classification. The research provides multi-tagging to the document according to a set of criteria, one of these tags is the hierarchical classification for the document that could play an efficient role in its related field. For example, documents in healthcare systems beehive could lead to exploring a new symptom of a disease, as it is known that symptoms could continuously mutate over time. The proposed method succeeds through the generated schema to relate between old and new symptoms, which makes it no surprise when evolving and gives a chance for pre-preparation and success to containment. The technical challenges of this study include the ability to successfully apply text mining techniques and machine learning. Additionally, the higher level of challenges that arise in this study is the fact that the processing is applied to Arabic text documents. Arabic has been known to be a complex language as it has its unique nature. The proposed method has been applied, compared with known methods, and its effectiveness has been confirmed by applying a classification task with an Accuracy equal to 99.5%.
