Stacked Sparse Autoencoders for Human Action Recognition

Page 1

International OPEN

Journal

ACCESS

Of Modern Engineering Research (IJMER)

Stacked Sparse Autoencoders for Human Action Recognition *

1

Muhammed Sadiq ,Mariofanna Milanova

2

*(Department of Computer Science, University of Arkansas at Little Rock, USA) ** (Department of Computer Science, University of Arkansas at Little Rock, USA) Corresponding Author: *Muhammed Sadiq

ABSTRACT: Determining human actions included in video stream remains a challenging problem in computer vision world. This is due to many reasons such as variations in the performance of the same action performed by different people, different viewpoints, insufficient training set and location of the person in video frames. In this paper, we demonstrate that sparse autoencoders which are trained on a number of video frames can achieve a high accuracy in determining and recognizing the type of action done by a human using popular datasets. We propose a new human action recognition system using multiple sparse autoencoders. In the first stage, we present an unsupervised deep learning algorithm and then we perform supervised learning through fine tuning the autoencoders with the appropriate labels for different types of human actions. The system is then applied onto three different datasets (Weizmann, KTH and UCF sports). The results demonstrate a performance of 99% in the Weizmann dataset for 10 actions, 86% in the KTH dataset for 6 actions and 96.7% in the UCF sport for 8 actions. Keywords: Autoencoder, background subtraction, deep learning, features, sparsity.

I. INTRODUCTION Nowadays, video data are increasing unbelievably every minute according to recent YouTube statistics [1]. The video content is becoming very important and related to millions of users over the internet. Therefore, the need for accurate and automatic systems for human action recognition is raising every day. Such systems play a major role in recent applications such as automatic monitoring of surveillance cameras, medical treatment, behavioral analysis, object recognition, efficient video search and video retrieval [2][3]. These systems are significantly valuable due to their ability to determine the important events or actions that occurred in the video stream. Many studies have been proposed in this active research area and the most popular way for implementing action recognition is by extracting features from video frames then choosing a suitable classifier for assigning features to the correct class of actions [2][4] [5] [6]. While deep learning (DL) is still considered as a new method in the field of image recognition and video analysis as compared to other methods in the same approach, DL can achieve higher accuracy in image recognition especially with large scale vision data. It was a turning point in the computer vision society when SuperVision team won Imagenet competition in 2012 [7]. Deep learning techniques have the ability to achieve different levels of abstractions through many layers with higher accuracy [8][9]. A considerable amount of research has been introduced in term of deep learning to perform object and human action recognition, but the issue of getting high accuracy is still at the stake for most of the studies in this field. Zhize Wu and et al [12] used stacked denoising autoencoders to learn and rebuild the depth information from joint 3D features to perform human action recognition on the video frames, features enhancement has been done by adding to the autoencoders a class and temporal constraints to catch the small details. Moez Baccouche and et al [13] proposed a 3D convolutional network to learn their features automatically and then a recurrent neural network has been trained using these features to detect the class of the video stream which is KTH dataset. A hybrid method can be found by [14] where Mahmudul Hasan and et al combined active learning with deep networks through updating the parameters of the sparse autoencoder to automatically choose the favorable features and to update the existing system. Extending the idea of using convolutional neural network in still image recognition to include video frames data was proposed by [15] and [16]. Karen Simonyan and et al [15] used Stacked Sparse Autoencoders for Human Action Recognition video frames as their input to train two stream ConvNets, while in [16] Andrej Karpathy and et al suggested using ConvNet for large scale video classification. However, their results were not satisfying. In this paper, the idea of object recognition is extended using stacked sparse autoencoders to the point of human action recognition in any video stream data by proposing a new DL architecture through multiple

| IJMER | ISSN: 2249–6645

www.ijmer.com

| Vol. 7 | Iss. 8 | August. 2017 | 54 |


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.