A Comparative Study on Deep Learning and Machine Learning Models for Human Action Recognition in Aerial Videos
Unmanned Aerial Vehicle (UAV) finds its significant application in video surveillance due to its low cost, high portability and fast-mobility. In this paper, the proposed approach focuses on recognizing the human activity in aerial video sequences through various keypoints detected on the human body via OpenPose. The detected keypoints are passed onto machine learning and deep learning classifiers for classifying the human actions. Experimental results demonstrate that multilayer perceptron and SVM outperformed all the other classifiers by reporting an accuracy of 87.80% and 87.77% respectively whereas LSTM did not produce very good results as compared to other classifiers. Stacked Long Short-Term Memory networks (LSTM) produced an accuracy of 71.30% and Bidirectional LSTM yielded an accuracy of 76.04%. The results also indicate that machine learning models performed better than deep learning models. The major reason for this finding is the lesser availability of data and the deep learning models being data hungry models require a large amount of data to work upon. The paper also analyses the failure cases of OpenPose by testing the system on aerial videos captured by a drone flying at a higher altitude. This work provides a baseline for validating machine learning classifiers and deep learning classifiers against recognition of human action from aerial videos.
[1] Andriluka M., Schnitzspan P., Meyer J., Kohlbrecher S., Petersen K., Stryk O., Roth S., and Schiele B., “Vision Based Victim Detection from Unmanned Aerial Vehicles,” in Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, pp. 1740-1747, 2010. DOI: 10.1109/IROS.2010.5649223
[2] Bolin J., Crawford C., Macke W., Hoffman J., Beckmann S., and Sen S., “Gesture-Based Control of Autonomous Uavs,” in Proceedings of the 16th Conference on Autonomous Agents and A Comparative Study on Deep Learning and Machine Learning Models for Human ... 573 MultiAgent Systems, Brazil, pp. 1484-1486, 2017. https://dl.acm.org/doi/10.5555/3091125.3091337
[3] Cao Z., Hidalgo G., Simon T., Wei S., and Sheikh Y., “Openpose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172-186, 2019. https://doi.org/10.1109/TPAMI.2019.2929257
[4] Chen B., Hua C., Li D., He Y., and Han J., “Intelligent Human-UAV Interaction System with Joint Cross-Validation Over Action-Gesture Recognition And Scene Understanding,” Applied Sciences, vol. 9, no. 16, pp. 3277, 2019. https://doi.org/10.3390/app9163277
[5] Li M., Zhou Z., Li J., and Liu X., “Bottom-up Pose Estimation of Multiple Person with Bounding Box Constraint,” in Proceedings of 24th International Conference on Pattern Recognition, Beijing, pp. 115-120, 2018. https://doi.org/10.48550/arXiv.1807.09972
[6] Penmetsa S., Minhuj F., Singh A., and Omkar S., “Autonomous UAV for Suspicious Action Detection Using Pictorial Human Pose Estimation and Classification,” ELCVIA: Electronic Letters on Computer Vision and Image Analysis, vol. 13, no. 1, pp. 18-32, 2014. DOI: 10.5565/rev/elcvia.582
[7] Perera A., Law Y., and Chahl J., “UAV- GESTURE: A Dataset for UAV Control and Gesture Recognition,” in Proceedings of the European Conference on Computer Vision Workshops, 2018.
[8] Perera A., Law Y., and Chahl J., “Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition,” Drones, vol. 3, no. 4, pp. 82, 2019. https://doi.org/10.3390/drones3040082
[9] Perera A., Law Y., and Chahl J., “Human Pose and Path Estimation from Aerial Video Using Dynamic Classifier Selection,” Cognitive Computation, vol. 10, no. 6, pp. 1019-1041, 2018. https://doi.org/10.1007/s12559-018-9577-6
[10] Pishchulin L., Insafutdinov E., Tang S., Andres B., Andriluka M., Gehler P., and Schiele B., “Deepcut: Joint Subset Partition and Labeling for Multi Person Pose Estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4929-4937, 2016. https://doi.org/10.48550/arXiv.1511.06645
[11] Pushparaj S. and Arumugam S., “Using 3D Convolutional Neural Network in Surveillance Videos For Recognizing Human Actions,” The International Arab Journal of Information Technology, vol. 15, no. 4, pp. 693-700, 2018. https://www.iajit.org/PDF/July%202018,%20No. %204/8768.pdf
[12] Singh A., Patil D., and Omkar S., “Eye in the Sky: Real-Time Drone Surveillance System (DSS) for Violent Individuals Identification Using Scatternet Hybrid Deep Learning Network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake, pp. 1629- 1637, 2018. DOI: 10.1109/CVPRW.2018.00214
[13] Song Y., Demirdjian D., and Davis R., “Tracking Body and Hands for Gesture Recognition: NATOPS Aircraft Handling Signals Database,” in Proceedings of Face and Gesture, Santa Barbara, pp. 500-506, 2011. DOI: 10.1109/FG.2011.5771448
[14] Toshev A. and Szegedy C., “Deeppose: Human Pose Estimation Via Deep Neural Networks,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus, pp. 1653-1660, 2014. https://doi.org/10.1109/CVPR.2014.214
[15] Wei S., Ramakrishna V., Kanade T., and Sheikh Y., “Convolutional Pose Machines,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4724-4732, 2016.
[16] Zebhi S., Almodarresi S., and Abootalebi V., “Human Activity Recognition Based on Transfer Learning with Spatio-Temporal Representations,” The International Arab Journal of Information Technology, vol. 18, no. 6, pp. 839-845, 2021. https://doi.org/10.34028/iajit/18/6/11