Group Sparse-Based Mid-Level Level Representation for Action Recognition
Abstract: Mid-level level parts are shown to be effective for human action recognition in videos. Typically, these semantic parts are first mined with some heuristic rules, then videos are represented via volumetric max max-pooling pooling (VMP) method. However, these methods have two issues: 1) the VMP strategy divides videos by static grids. In this case, a semantic part may occur in different localizations in different videos. That means the VMP strategy loses the space space-time time invariance. To solve this problem, we propose to apply a saliency saliency-driven max-pooling pooling scheme to represent a video. We extract the video semantic cues by the saliency map, and dynamically pool the local maximum responses responses.. This scheme can be considered as a semantic content-based based feature alignment method and 2) the parts discovered by heuristic rules may be intuitive but not discriminative enough for action classification because they neglect the relations between the detectors. dete For this issue, we propose to apply a sparse classifier model to select discriminative parts. Moreover, to further improve the discriminative ability of the representation, we propose to conduct feature selection by the corresponding entry magnitude of the model coefficients. We conduct experiments on four challenging datasetsdatasets KTH, Olympic Sports, UCF50, and HMDB51. The results show that the proposed method significantly outperforms the state state-of-the-art methods.