International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637
Automated Human Action Recognition for Effective Surveillance System using 3D Convolutional Neural Network Sathyashrisharmilha.P1 Department of Computer Science and Engineering1 Adithya Institute of Technology1 Email: shriapr29@gmail.com1 Abstract—Recognizing the non-rigid objects like humans actions attracts the major attention in recent years. There arises a demand for automated surveillance systems in order to make the patrolling more effective. The major component of visual surveillance system is the human action recognition. This work mainly focusto develop a system for recognizing the human actionsautomatically. The input is the realistic surveillance video data fed into the system. From the 2D videos, the 3D features are extracted using convolution neural network.The performance of the system depends on the training of action templates. Whenever a human action matches with the template actions then the system intimate the officials who are responsible to provide immediate attention to it. The normal human actions as well as the suspicious human actions are to be trained well to the system. This work tend to identify the suspicious human actions happening within the premises and to intimate the officials regarding the actions to take their immediate attention. Index Terms –Surveillance videos; 3D Convolutional Neural Network;Template matching;Automated Human Action Recognition. cameras. This is due to the occlusions, change in the viewpoints from various cameras, cluttered 1. INTRODUCTION backgrounds. The Automatic Human Action Recognition has The initiation of the work is done by drawn a major attention in the arena of video segmenting the objects, next it proceed by analysis technology due to the growing demands extracting the features from the segmented parts. from many applications [14]. Such applications are Once the features are extracted, the algorithms are surveillance environments, entertainment applied according to the need. The above said environments and healthcare system [4].In the procedures are considered as the basic steps needed surveillance environment, the automatic detection to start the recognition process. The recognition of of suspicious action alerts the respective authority human actions can be three different scenarios. about the different actions of humans which are They are recognizing single person’s actions, banned inside the premises [5]. For instance, the multi-persons actions and suspicious actions.The environment considered here is the educational basic idea about the recognition system is given in institutions. Similarly, in the other two Fig.1. applications, the human actions and activities are recognized automatically to make the surveillance much more effective. There have been numerous research efforts reported for various applications based on human action recognition [8, 13] , more specifically, abnormal or suspicious actions, human gestures, human interaction, pedestrian traffic and simple actions. The actions are different from activity and vice versa. An action is referred as a simple pattern or an elementary part of a motion. A series of actions are considered as an activity. This work is fully related to actions. Recognizing the non-rigid objects like humans Fig. 1. An overview of a general recognizing and their actionsfrom a surveillance videos still system for human actions proves to be a significant challenge. The accuracy in recognizing the actions depends on the The main purpose of this model is to recognize resolution of the videos taken from the surveillance automatically the different or peculiar actions done 52
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 by the humans which are actually not allowed within the college premises. Through effective training, make the system to recognize the human actions by itself. This makes the surveillance effective and updated. The basic idea revolves around the system is segmenting the surveillance video data into finite frames from the same format as they are captured.Allowing the system to be trained for both normal and different actions of humans. After training, testing is done. Meantime, features are extracted using 3D convolution, and trained using 3D convolutional neural network [3]. When the template matches with the suspicious actions from the video then the system produces an intimation to the concerned authorities to have their immediate attention. The rest of the paper is structured as follows: Surveying of the reference work discussed along with the methods, merits and demerits specified in Section 2. Related work regarding human action recognition is given in Section 3. The proposed system of automated human action recognition is given in Section 4. Section 5 demonstrates the experimental setup and results. Section 6 describes about the analysis of performance. The conclusion and future work is given in Section 7. 2. SURVEY OF LITERATURE [Yang Ming et al., (2009)] proposed a method called Motion Edge History Images MEHI for detecting the human actions by boosting the efficient motion features. MEHI is a resourceful approach to detect the basic human activities or actions like making phone calls, position of the hand pointing to an object, action of object put down. Additionally, MEHI facilitates the system through learning and it learns very fast using tree structured boosting classifiers. This classifier encodes both the shape and motion patterns of actions. It is robust and tend to diminish the searching space. This method is well suitable in recognizing the 16 actions. It achieves a highly competitive performance than state-of-the-art algorithm. It also tries to lessen the restriction of slow camera motion. [MobahiHosseinet al., (2009)] proposed a learning method for deep architecture that takes gain of sequential data exactly indicates the temporal coherence. The temporal coherence exists between the unlabeled video recordings. Temporal coherence occurred in videos which acts as a pseudo-supervisory signal. Both the labeled and unlabeled data are trained concurrently. This unlabeled data known as the temporal coherence which acts as a regularizer. It produces better results when the unlabeled data comes from the same datasets. If the objects are more similar to each other, then the data are more beneficial.
[Wang and Mori (2011)]offered a discriminative part-based approach for Human Action Recognition (HAR) for video sequences using motion features. This method is centered on Hidden Conditional Random Field (HCRF) for object recognition [12]. Like HCRF, model a human action by flexible gathering of parts on image observations. This model combines both the global and local patch features together. The combined effect results more effective than HCRF. A substitute model for learning parameters of HCRF model is the Max-Margin framework (MMHCRF). It can handle much the complex hidden structures in various problems occurred in computer vision. Max-Margin learning allows to deal with a large spectrum of different complex structures. It produces best results even in the occurrence of partial occlusions, viewpoint changes and for robust features. [Gowsikhaaet al., (2012)] suggested a method for video analytics. This method is meant for processing a video gathering data, analyzing the data for getting domain specific information. This method takes the realistic live surveillance videos for detecting activities. For preprocessing the videos, Gabor filter is used. Foreground and background estimation is done by Artificial Neural Network (ANN). By using ANN, the face recognition is made automated. This work is best appropriate for the scenario of examination hall in order to avoid the malpractices in the nonappearance of supervisor in the hall. For which, face recognition, hand recognition and contact between face and hand are to be processed. It decreases the error rate due to manual monitoring by humans. Very effective for real time security applications. Even though in different lightings, illuminations produces an appropriate results with less computational time. [ChararaNouret al., (2012)] proposed a global method based on depth zones segmentation. This method is meant for the detection of abnormal behavior of humans automatically in video surveillance. Generally, the Intelligent Video Surveillance (IVS) drawn the attention in filtering out any number of useless information. This work focuses mainly for event recognition in crowded scene to detect abnormal behavior. By which, it saves lot of human resources, provides effective security, reduces the lags in monitoring systems and intimate the officials using real-time alarming. By making the several descriptions and comparisons among various reference work in Table 1, gives an idea to conclude which algorithms or methods to be used for certain functions. Many methods are introduced for various different purposes. By knowing the purpose of the various methods before starting the
53
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 work one can easily know to apply the methods for suitable purposes of implementation work. Table 1. Comparison table for different methods used for recognizing human actions S.No. 1
2
METHOD MEHI – Motion Edge History Images
MERITS Achieves competitive performance in comparison with the state-of-the-art algorithms
Beneficial for many supervised tasks and huge collections of data can be obtained without human annotation 3 HCRF – Hidden Quite effective in Conditional Random recognizing actions Field, while combining MMHCRF – Maxboth large-scale Min HCRF features and local patch features 4 Provides automatic Gabor filter, invigilation Background estimation, Foreground estimation 5 Global method High efficiency in based on depth security protection, zones segmentation Save a lot of human and material resources by reducing the lags 3. RELATED WORK A deep learning algorithm for visual object recognition
As in the case of surveillance environment, a large number of human beings are captured and recorded in the surveillance cameras. Obviously, each individual is unique, so each person’s actions differ from other person [1]. Most probably, the human actions are normal under a watchful environment. There are some chances for the occurrence of susceptible or peculiar actions done by the humans within the campus premises.Detecting and recognizing those actions are difficult and it depends upon the situation they belong to. For instance, these are all the human actions found as suspicious within a college premises: usage of cell phones by the students are banned, standing on the side walls, trying to jump from heights, usage of weapons inside the campus and a person needs a medical assistance. Recognition can be achieved only step by step. After the input is fed into the system, from them features extraction to be taken place. The features
LIMITATIONS APPLICATION MEHI are inherently notDetecting human actions sensitive to appearance from a monocular videos variations or clutters such as video surveillance and intelligent video content analysis
When two sets come from different datasets, use of unlabeled data is still beneficial. But if the objects are different then it is quite difficult One has to manually specify the parts
Useful for non-visual tasks as well where sequence information has structure. E.g., Speaker verification to name In computer vision and in many other fields
In examination halls Converting RGB to grayscale images, so it is limited to directly applicable for colored images Crowded scenes, Popular in security Hard lighting conditions, applications – public Abnormal detection places, military applications system based on heterogeneous sensor network are extracted using Gabor filter is given in “Eq. (1)”, exp( j ∗ gi ∗ m) gi 2 m 2 gi (1) 2 ψ i (m) = 2 exp − 2 − exp − σ 2 ∗ σ σ 2
(
)
where, gi, σ are Gabor function general arguments. On the extracted features, by applying the 3D CNN for training the system with the template samples [11].While training the system, gradually it gets trained. In order to get the accurate results as expected, the weights of the appropriate nodes are to be changed. For checking the errors occurred during training is by using online backpropagation [6]. The Gauss-Newton method approximation to Hessian matrix are extensively used in statistics of regression is given in “Eq. (2)”. They are only related for small number of parameters. They are not in need of line search, so they can also be used
54
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 in stochastic mode even though that has not been tested.
∂Ν (ω , χ p )' ∂Ν (ω, χ p ) ∇E (ω ) (2) ∆ω = ∑ p ∂ω ∂ω Here, Levenberg-Marquardt algorithm is used to maintain the parameters if some eigenvalues are small is shown in “Eq. (3)”, −1
have the immediate attention towards the action. The intimation is offered through the e-mail and by phone message (sms).The proposed automated human action recognition system is given in Fig. 2.
∂Ν (ω , χ p )' ∂Ν(ω , χ p ) ∆ω = ∑ + µI ∇E (ω ) (3) p ∂ω ∂ω The subsequent idea is about the Stochastic Diagonal Levenberg-Marquardt method. When a backpropagation of the diagonal Hessian to keep a running estimate of the second derivative of the error with respect to each parameter. By using these expressions in Levenberg-Marquardt formula to scale each parameter’s learning rate is given in “Eq. (4)”, ε (4) η ki = 2 ∂ E +µ ∂ωki2 −1
2 where, ε is the global learning rate, ∂ E is the ∂ωki2 estimate of the diagonal second derivative with respect to weight(ki), µ is the Levenberg-Marquardt parameter. The background idea to keep in mind is to reduce the trainable parameters. The parameters are randomly initialized and trained using the above methods. Through pursuing the information from the related works, anew idea arises and it is to be discussed in the forthcoming section.
4. PROPOSED WORK The major problem come a crossed in most of the work are differentiating the suspicious human actions from the normal actions. The suspicious actions are also human actions, here it becomes a challenge for the system to recognize them automatically. But this problem can be resolved by using templates. Matching with the template is the final procedure of system learning. The work starts with the input, the input are the surveillance video data. These videos are fragmented into frames. From each frame, the features can be extracted using Gabor filter. The extracted features are in 3D object format. On the features, applying the 3D Convolutional Neural Network (CNN) for training the system. Once the system gets well trained, then it is ready to test the video samples. If the system is well trained, then the actions from each frame are tend to match with the template given as training samples. When matches found between the captured actions and templates, at that moment an intimation is sent to the concerned authority to
Fig. 2. Proposed System of Automated Human Action Recognition From the Fig. 2, the 3D CNN for training the system follows different layers. The hardwired kernel layer, Convolutional layers, Sub-sampling layers and the last fully connected layers. The convolution and sub-sampling layers are performed consecutively. After training, the stipulated actions goes to the recognized action classes if they are matched each other. Only the suspicious actions are intimated to the officers through SMS and Email notifications. After developing this proposed model, several experiments are to be made both for training and testing purposes. The forth coming experimental setup describes the idea in detail. 5. EXPERIMENTAL SETUP AND RESULT ANALYSIS The proposed realistic Automated Human Action Recognition Model was designed and experimented using specific hardware and software requirements. The experiments are conducted on an Intel Core i5 CPU M480 @ 2.67GHz processor with 4 GB of RAM. The simulation part of the work is carried out through MATLAB tool (R2013b). All the experimental setup is happened on Microsoft Windows 7 Ultimate 32-bit operating system environment. 55
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 As the video samples are taken from the realistic college environment, the whole video is filled with many people who are all visited the campus. It means that the recorded video is full of humans so it paves the way for the need of human detector to locate the humans. Once the humans are detected in the image frame, then a bounding box for each individual are marked. This bounding box indicates the movement of humans by tracking along with the people. The recognition made by the system to be perfect for which the recognition rate is to be calculated. The recognition rate is the number of actions recognized correctly is NR to the total number of samples is NSand it is given in “Eq. (5)â€?, N (5) Re cognition Rate = R Ă— 100 NS The work starts with the individual adjacent frames of the video sample. The input taken as the original image converted into gray-scale image. Next, the horizontal and vertical image both for gradient and optical flow are collected for the input frame respectively. The humans are detected using detector and they are surrounded by the bounding box. Once the actions caught matched with the template as trained, then those actions taken places frame numbers are intimated through short message service and also by e-mail to the authority.
a) Usage of cell phones inside campus is banned for students
working hours. Fig. 3(b) shows that a person needs immediate medical assistance. Fig. 3(c) clearly shows a group chatting for hours inside the laboratory wasting their valid times. Fig. 3(d) shows the dispute and fighting each other which is strictly not allowed inside campus. All these actions are considered as suspicious as they are not abide by the rules of the institution. Table 2. Shows the error rate of CNN and ANN while training Epoch No. 0 2 4 6 8 10 12 14 16 18 20 22 24 26
CNN (%) 50 20 16 12 8 8 8 8 6 7 7 7 7 7
ANN (%) 53 24 22 17 18 14 16 15 10 11 12 11 11 11
For any training method, at the initial stages produces high error rate as the system learns everything new for the first time. As of which, this is exactly happens in CNN and in ANN also, even they too shows high error rate at zero epoch. Gradually, as the system learns of the action classes soon, the error rate reduces accordingly. This is clearly shown in Table 2 and through the analysis CNN outperforms than ANN. 6. PERFORMANCE ANALYSIS
b) A person fall down on floor needs medical help
c) Standing in groups for long d) Serious fighting between Timepeople within campus
Fig. 3. Templates to train the system for suspicious actions The actions set as templates are the training samples to the system. While choosing template actions, one to be very careful as they are the matching source for detecting and recognizing the suspicious actions within the campus. The sample template actions are shown in Fig. 3. In which, Fig. 3(a) indicates the action of phone to the ear which is a banned action for students within
The performance of the automated human action recognition are well discussed here. The performance of this system depends mainly on humans and their actions. As this video data consists of only the humans it is quite easy to proceed the work, as there is no need for recognizing humans from animals. The detection of humans, contact between the humans, the actions happened between them, the system works well in multi-person scenario. All these scenarios are taken into consideration for analyzing the system. As per the Table 2, CNN shows minimum error rate while compared with other method called ANN. This could be made much easier to understand by a comparison chart shown below in Fig. 4.
56
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
Percentage Error
E-ISSN: 2321-9637 60 40 20 0 0
4
8
12 16 20
24
Number of Epochs CNN ANN Fig. 4. Shows the variations in the error rate both by CNN and ANN for different epochs This evidence proves that CNN yields better results in training the system than any other methods like ANN. For various human actions stated in Fig. 3, also yields the same performance as of in this chart. 7. CONCLUSION AND DISCUSSIONS This work considered neural network as the core idea in developing the system. As neural network is the best learning method to train the system. Though it takes time for learning, still it can recognizes the actions quickly once trained well. Both the normal and suspicious human actions are fed into the system for training. Even the normal actions need to be identified by the system automatically. Now, it is easier for the system to differentiate actions different from normal and proceed with the intimation process if it is susceptible. The 3D CNN trains the system well along with the backpropagation. This work is developed with an idea to make the surveillance much more effective and gives priority to the people safety. This work considered only the educational institutions surveillance but it can be extended to all type of other surveillances like hospitals, railway stations and in any crowded areas. REFERENCES [1] Burghouts, G. J.; Schutte, K. (2013): Spatiotemporal layout of human actions for improved bag-of-words action detection. Pattern Recognition Letters 34(15), pp. 1861-1869. [2] Charara, Nour; Jarkass, Iman; Sokhn, Maria; Mugellini, Ellena; Khaled, O. A. (2012): ADABeV: Automatic detection of abnormal behavior in video-surveillance. World academy of science, engineering and technology, 68, pp. 172-178. [3] Choo, J. Tan; Lim, P. Chee; Cheah, Yu-N. (2014): A multi-objective evolutionary algorithm-based ensemble optimizer for feature selection and classification with neural
network models. Neurocomputing 125, pp. 217-228. [4] Gonzalez, C. R.; Woods, E. Richards. (2008). Digital Image Processing, 3rd edn. Pearson Education, Inc. publishing as Prentice Hall. [5] Gowsikhaa, D.; Manjunath; Abirami, S. (2012): Suspicious human activity detection from surveillance videos. International journal on internet and distributed computing systems, 2(2), pp. 141-148. [6] Han, Jiawei; Kamber, Micheline; Pei, Jian. (2006). Data mining: concepts and techniques, Morgan Kaufamann Publishers Inc. San Francisco. [7] Mobahi, Hossein; Collobert, Ronan; Weston, Jason (2009): Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp. 737-744. [8] Shabani, Amir, H.; John, S. Zelek; and David, A. Clausi. (2013): Multiple scale-specific representations for improved human action recognition. Pattern Recognition Letters 34(15), pp. 1771-1779. [9] Wang, Yang; Mori, Greg (2011): Hidden part models for human action recognition probabilistic versus max margin. IEEE Transactions on pattern analysis and machine intelligence, 33(7), pp. 1310-1323. [10] Yang, Ming; Lv, Fengjun; Xu, Wei; Yu, Kai; Gong, Yihong (2009): Human action detection by boosting efficient motion features. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pp. 522-529. [11] Zeeshan, Zia, M.; Stark, Michael; Schiele, Bernt; Schindler, Konrad (2013): Detailed 3D representations for object recognition and modeling. IEEE Transactions on pattern analysis and machine intelligence, 35(11), pp. 2608-2623. [12] Zhang, Kaihua; Zhang, Lei; Yang, M. (2013): Real-time object tracking via online discriminative feature selection. IEEE Transactions on image processing, 22(12), pp. 4664-4677. [13] Zhao, Dongjie; Ling, Shao; Xiantong, Zhen; Yan, Liu. (2013): Combining appearance and structural features for human action recognition. Neurocomputing 113, pp. 88-96. [14] Zhen, Xiantong; Ling, Shao; Dacheng, Tao; Xuelong, Li. (2012): Embedding Motion and Structure Features for Action Recognition. Pp. 1182-1190.
57