Multimodal Video Video-to-Near-Scene Annotation
Abstract: Traditional video annotation approaches focus on annotating keyframes/shots or whole videos with semantic keywords. However, the extraction processes of keyframes/shots /shots might lack semantic meanings, and it is hard to use a few keywords to describe the content of a long video with multiple topics. In this work, near-scenes, scenes, which contain similar concepts, topics, or semantic meanings, are designed for better video ccontent ontent understanding and annotation. We propose a novel framework of hierarchical video video-to-near-scene scene annotation not only to preserve but also to purify the semantic meanings of near near-scenes. scenes. To detect nearnear scenes, a pattern-based based prefix tree is first const constructed ructed to fast retrieve nearnear duplicate videos. Then, the videos containing similar near near-duplicate duplicate segments and similar keywords are clustered with consideration of multimodal features including visual and textual features. To enhance the precision of near-scene near detection, a pattern-to-intensity intensity-mark mark (PIM) method is proposed to perform precise frame-level near-duplicate duplicate segment alignment. For each near-scene, near a video-to-concept concept distribution model is designed to analyze the representativeness of keywords and discriminations of clusters by the proposed potential term frequency and inverse document frequency and entropy. Tags are ranked according to video-to-concept concept distribution scores, and the tags with the highest scores are propagated to near near-scenes detected. Extensive experiments demonstrate that the proposed PIM outperforms state state-of-the-art art approaches