Multimodal Human Interaction Recognition Framework Using Multi-Features and Deep Learning Approach

Author Tanvir Fatima Naik Bukht, Haita Alhaston, Noif Alshamrari, Nouf Abdullah Almujally, Ahmad Jalal,

Keywords #Image analysis #pattern analytics #body pose #interaction classification #deep learning

Abstract

Human Interaction Recognition (HIR) is one of the most important research topics in computer vision and pattern recognition that deals with the identification of specific interactions in static images and has several challenges that are related to the lack of temporal data, feature extraction, variability in image conditions, and the requirement of more accurate and interpretable robust models. However, current approaches face difficulties in recognizing the static images potential for interaction recognition, which results in a lack of effective algorithms using these resources. Addressing these gaps could potentially lead to great strides in the field. This paper aims to fill this gap by presenting a new Convolutional Neural Network (CNN)-based deep learning framework for interaction recognition, which integrates multimodal data for enhanced performance. The following steps are followed in the methodology: Preprocessing the images using Hue Saturation Value (HSV) color transformation to improve the image quality and silhouette extraction using Multiple Object Tracking (MOT) and Visual Background Subtractor (ViBe) techniques. We employ two distinct feature extraction approaches: Texton map for full body features and geometric attributes for skeleton features. The extracted features are then efficiently discriminated using Quadratic Discriminant Analysis (QDA). The analysis of our proposed framework suggests that the recognition rate on the Shakefive2 dataset is 90.2%, and the accuracy on the University of Lincoln (UoL) dataset is 92.3%. These results were compared to baseline models, such as traditional methods (e.g., handcrafted features), showing improved performance. These results show that the proposed method is a good solution for human interaction recognition based on static images. This research helps to enhance state-of-the-art deep learning-based algorithms for human interaction recognition that could be used for human-computer interaction, video analysis, and surveillance, and thus contributes to the field of computer vision.

References

[1] Al Mokhtar Z. and Dawwd S., “3D VAE Video Prediction Model with Kullback Leibler Loss Enhancement,” The International Arab Journal of Information Technology, vol. 21, no. 5, pp. 879- Multimodal Human Interaction Recognition Framework Using Multi-Features and ... 141 888, 2024. DOI:10.34028/iajit/21/5/9

[2] Alonazi M., Ansar H., Al Mudawi N., Alotaibi S., and et al., “Smart Healthcare Hand Gesture Recognition Using CNN-Based Detector and Deep Belief Network,” IEEE Access, vol. 11, pp. 84922- 84933, 2023. DOI: 10.1109/ACCESS.2023.3289389

[3] Barnich O. and Droogenbroeck M., “ViBe: A Universal Background Subtraction Algorithm for Video Sequences,” IEEE Transactions on Image Processing, vol. 20, pp. 1709-1724, 2010. https://doi.org/10.1109/TIP.2010.2101613

[4] Chaaraoui A., Perez P., and Revuelta F., “A Review on Vision Techniques Applied to Human Behaviour Analysis for Ambient-Assisted Living,” Expert Systems with Applications, vol. 39, pp. 10873-10888, 2012. https://doi.org/10.1016/j.eswa.2012.03.005

[5] Cheng X. and Zhang P., “Enhanced Soccer Training Simulation Using Progressive Wasserstein GAN and Termite Life Cycle Optimization in Virtual Reality,” The International Arab Journal of Information Technology, vol. 21, no. 4, pp. 549-559, 2024. DOI: 10.34028/iajit/21/4/1

[6] Coppola C., Cosar S., Faria D., and Bellotto N., “Automatic Detection of Human Interactions from RGB-D Data for Social Activity Classification,” in Proceedings of the 26th IEEE International Symposium on Robot and Human Interactive Communication, Lisbon, pp. 871-876, 2017. https://doi.org/10.1109/ROMAN.2017.8172405

[7] Dua N., Singh S., and Semwal V., “Multi-Input CNN-GRU Based Human Activity Recognition Using Wearable Sensors,” Computing, vol. 103, pp. 1461-1478, 2021. https://doi.org/10.1007/s00607-021-00928-8

[8] Feudo S., Dion J., Renaud F., Kerschen G and Noel J., “Video Analysis of Nonlinear Systems with Extended Kalman Filtering for Modal Identification,” Nonlinear Dynamics, vol. 111 pp. 13263-13277, 2023. https://link.springer.com/article/10.1007/s11071- 023-08560-1

[9] Garcia S., Baena C., and Salcedo A., “Human Activities Recognition Using Semi-Supervised SVM and Hidden Markov Models,” TecnoLogicas, vol. 26, no. 56, pp. 1-19, 2023. https://doi.org/10.22430/22565337.2474

[10] Gemeren C., Poppe R., and Veltkamp R., “Hands- on: Deformable Pose and Motion Models for Spatiotemporal Localization of Fine-Grained Dyadic Interactions,” EURASIP Journal on Image and Video Processing, vol. 2018, no. 16, pp. 1-16, 2018. https://jivp- eurasipjournals.springeropen.com/articles/10.118 6/s13640-018-0255-0

[11] Gemeren C., Poppe R., and Veltkamp R., DPM Configurations for Human Interaction Detection, Utrecht University, 2016. https://webspace.science.uu.nl/~veltk101/publicat ions/art/nccv2015_p35L.pdf

[12] Hasan R. and Alani N., “A Comparative Analysis Using Silhouette Extraction Methods for Dynamic Objects in Monocular Vision,” Cloud Computing and Data Science, vol. 1, pp. 1-12, 2022. https://www.researchgate.net/publication/357860 149_A_Comparative_Analysis_Using_Silhouette _Extraction_Methods_for_Dynamic_Objects_in_ Monocular_Vision

[13] He L., Jiang D., Yang L., Pei E., and et al., “Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks,” in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, Brisbane, pp. 73-80, 2015. https://doi.org/10.1145/2808196.2811641

[14] Hua G., Kumar H., Aradhya M., and Maheshan M., “A New Effective Speed and Distance Feature Descriptor Based on Optical Flow Approach in HAR,” Revue D’Intelligence Artificielle, vol. 37, pp. 109-115, 2023. http://dx.doi.org/10.18280/ria.370114

[15] Jalal A., Kim Y., Kim Y., Kamal S., and et al., “Robust Human Activity Recognition from Depth Video Using Spatiotemporal Multi-Fused Features,” Pattern Recognition, vol. 61, pp. 295- 308, 2017. https://doi.org/10.1016/j.patcog.2016.08.003

[16] Khan A., Chefranov A., and Demirel H., “Image Scene Geometry Recognition Using Low-Level Features Fusion at Multi-Layer Deep CNN,” Neurocomputing, vol. 440, pp. 111-126, 2021 https://doi.org/10.1016/j.neucom.2021.01.085

[17] Khodabandelou G., Moon H., Amirat Y., and Mohammed S, “A Fuzzy Convolutional Attention- Based GRU Network for Human Activity Recognition,” Engineering Applications of Artificial Intelligence, vol. 118, pp. 105702, 2022. https://doi.org/10.1016/j.engappai.2022.105702

[18] Koping L., Shirahama K., and Grzegorzek M., “A General Framework for Sensor-Based Human Activity Recognition,” Computers in Biology and Medicine, vol. 95, pp. 248-260, 2018. https://doi.org/10.1016/j.compbiomed.2017.12.025

[19] Liu X., Shi H., Hong X., Chen H., and et al., “Hidden States Exploration for 3D Skeleton-Based Gesture Recognition,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, pp. 1846-1855, 2019. https://doi.org/10.1109/WACV.2019.00201

[20] Luo W., Xing J., Milan A., Zhang S., and et al., “Multiple Object Tracking: A Literature Review,” Artificial Intelligence, vol. 293, pp. 103448, 2021. https://doi.org/10.1016/j.artint.2020.103448 142 The International Arab Journal of Information Technology, Vol. 23, No. 1, January 2026

[21] Luvizon D., Tabia H., and Picard D., “Learning Features Combination for Human Action Recognition from Skeleton Sequences,” Pattern Recognition Letters, vol. 99, pp. 13-20, 2017. https://doi.org/10.1016/j.patrec.2017.02.001

[22] Manzi A., Fiorini L., Limosani R., Dario P., and Cavallo F., “Two-Person Activity Recognition Using Skeleton Data,” IET Computer Vision, vol. 12, no. 1, pp. 27-35, 2018. https://doi.org/10.1049/iet-cvi.2017.0118

[23] Modi N. and Ramakrishna M., “An Investigation of Camera Movements and Capture Techniques on Optical Flow for Real-Time Rendering and Presentation,” Journal of Real-Time Image Processing, vol. 20, no. 60, pp. 1-15, 2023. https://doi.org/10.1007/s11554-023-01322-7

[24] Mukherjee S., Anvitha L., and Lahari M., “Human Activity Recognition in RGB-D Videos by Dynamic Images,” Multimedia Tools and Applications, vol. 79, pp. 19787-19801, 2020. https://doi.org/10.48550/arXiv.1807.02947

[25] Nadeem A., Jalal A., and Kim K., “Accurate Physical Activity Recognition Using Multidimensional Features and Markov Model for Smart Health Fitness” Symmetry, vol. 12, no. 11, pp. 1-17, 2020. https://doi.org/10.3390/sym12111766

[26] Pang Y., Ke Q., Rahmani H., Bailey J., Liuand J., “IGFormer: Interaction Graph Transformer for Skeleton-Based Human Interaction Recognition,” in Proceedings of the European Conference on Computer Vision, Tel Aviv, pp. 605-622. https://arxiv.org/abs/2207.12100

[27] Pareek P. and Thakkar A., “A Survey on Video- Based Human Action Recognition: Recent Updates, Datasets, Challenges, and Applications,” Artificial Intelligence Review, vol. 54, pp. 2259- 2322, 2021. https://doi.org/10.1007/s10462-020- 09904-8

[28] Saleem G., Bajwa U., and Raza R., “Toward Human Activity Recognition: A Survey,” Neural Computing and Applications, vol. 35, pp. 4145- 4182, 2023. https://doi.org/10.1007/s00521-022- 07937-4

[29] Samir H., Abd El Munim H., and Aly G., “Suspicious Human Activity Recognition Using Statistical Features,” in Proceedings of the 13th International Conference on Computer Engineering and Systems, Cairo, pp. 589-594, 2018. https://doi.org/10.1109/ICCES.2018.8639457

[30] Shi L., Zhang Y., Cheng J., and Lu H., “Two- Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, pp. 12026-12035, 2019. https://doi.org/10.1109/CVPR.2019.01230

[31] Shuvo M., Ahmed N., Nouduri K., and Palaniappan K., “A Hybrid Approach for Human Activity Recognition with Support Vector Machine and 1D Convolutional Neural Network” in Proceedings of the IEEE Applied Imagery Pattern Recognition, Workshop, Washington (DC), pp. 1-5, 2020. https://doi.org/10.1109/AIPR50011.2020.9425332

[32] Waheed M., Javeed M., and Jalal A., “A Novel Deep Learning Model for Understanding Two- Person Interactions Using Depth Sensors,” in Proceedings of the International Conference on Innovative Computing, Lahore, pp. 1-8, 2021. https://doi.org/10.1109/ICIC53490.2021.9692946

[33] Wang X., Sun Z., Chehri A., Jeon G., and et al., “Deep Learning and Multi-Modal Fusion for Real- Time Multi-Object Tracking: Algorithms, Challenges, Datasets, and Comparative Study,” Information Fusion, vol. 105, pp. 102247, 2024. https://doi.org/10.1016/j.inffus.2024.102247

[34] Yadav S., Tiwari K., Pandey H., and Akbar S., “A Review of Multimodal Human Activity Recognition with Special Emphasis on Classification, Applications, Challenges and Future Directions,” Knowledge-Based Systems, vol. 223, pp. 106970, 2021. https://doi.org/10.1016/j.knosys.2021.106970

[35] Yan S., Xiong Y., and Lin D., “Spatial Temporal Graph Convolutional Networks for Skeleton- Based Action Recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, Lousiana, pp. 7444-7452, 2018. https://doi.org/10.1609/aaai.v32i1.12328

[36] Yi M., Lee W., and Hwang S., “A Human Activity Recognition Method Based on Lightweight Feature Extraction Combined with Pruned and Quantized CNN for Wearable Device,” IEEE Transactions on Consumer Electronics, vol. 69, pp. 657-670, 2023. https://doi.org/10.1109/TCE.2023.3266506

[37] Zhang L., “Enterprise Employee Work Behavior Recognition Method Based on Faster Region Convolutional Neural Network,” The International Arab Journal of Information Technology, vol. 22, no. 2, pp. 291-302, 2025. https://doi.org/10.34028/iajit/22/2/7

[38] Zhang S., Li Y., Zhang S., Shahabi F., and et al., “Deep Learning in Human Activity Recognition with Wearable Sensors: A Review on Advances,” Sensors, vol. arXiv:2111.00418v5, pp. 1-42, 2020. https://arxiv.org/abs/2111.00418v5

[39] Zheng L., Tang M., Chen Y., Zhu G., and et al., “Improving Multiple Object Tracking with Single Object Tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, pp. 2453-2462, 2021. https://doi.org/10.1109/CVPR46437.2021.00248

Abstract:

URL: https://iajit.org/paper/5357

,abstract={

},
keywords={Image analysis,pattern analytics,body pose,interaction classification,deep learning},
ISSN={2413-9351},
month={Jan}}

AB -