正在加载图片...
FU et al:MULTI-VIEW VIDEO SUMMARIZATION 719 View 1 Video View 2 Video Video Multi-view Importance Spatio-temporal RandomShot Multi-objective Parsing Video Shots Computation Shot Graph Walks Clusters Optimization View N Video Fig.1.Overview of our multi-view video summarization method. Besides,automatic music summarization has been considered gether a set of intrinsic video features.The multi-view shots usu in[28]. ally have diverse correlations with different attributes,such as Graph model has also been used for video summarization. temporal adjacency and content similarity.We use a hypergraph Lu et al.[10]developed a graph optimization method that to systematically characterize the correlations among shots.A computes optimal video skim in each scene via dynamic hypergraph is a graph in which an edge,usually named as a hy- programming.Ngo et al.[13]used temporal graph analysis peredge,can link a subset of nodes.Each kind of correlation to effectively capsulate information for video structure and among multi-view shots is thus represented with a kind of hy- highlight.Through modeling the video evolution by temporal peredge in the hypergraph.The hypergraph is further converted graph,their method can automatically detect scene changes and into a spatio-temporal shot graph where correlations of shots in generate summaries.Lee et al.[29]presented a scenario-based each view and across multi-views are mapped to edge weights. dynamic video abstraction method using graph matching. To implement multi-view summarization on the spatio-tem- Multi-level scenarios generated by a graph-based video seg- poral graph,we employ random walks to cluster those event- mentation and a hierarchical segment are used to segment a centered similar shots.Using them as the anchor points,final video into shots.Dynamic video abstractions are accomplished summarized multi-view shots are generated by a multi-objective by accessing the hierarchy level-by-level.Another graph-based optimization model that supports different user requirements as video summarization method is given by Peng and Ngo [14]. well as multi-level summarization. Highlighted events can be detected by a graph clustering al- We use the multi-view video storyboard and the event-board gorithm,incorporating an effective similarity metric of video to represent the multi-view summaries.The multi-view story- clips.Comparing with their methods,we focus on multi-view board demonstrates the event-centered summarized shots in a videos.Due to content correlations among multi-views,the multi-view manner as shown in Fig.5.In contrast,the event- spatio-temporal shot graph we constructed has more compli- board shown in Fig.6 assembles those summarized shots along cated node connections,making summarization challenging. the timeline. The above methods provide many effective solutions to mono-view video summarization.However,to the best of our IV.SPATIO-TEMPORAL SHOT GRAPH knowledge,few methods are dedicated to multi-view video summarization.Multi-view video coding (MVC)algorithms It is difficult to directly generate summarization,especially the video skims from multi-view videos.A common idea is to [30]-[32]also deal with the multi-view videos.Using tech- first parse the videos into shots.In this way,video summariza- niques such as motion estimation,disparity estimation,and tion is transformed into a problem of selecting a set of repre- so on,MVC removes information redundancy in spatial and sentative shots.Obviously,the selected shots should favor inter- temporal domains.The video content is however unchanged. esting events.Meanwhile,these shots should be nontrivial.To Therefore,MVC could not remove redundancy at the semantic achieve this,content correlations as well as disparities among level.In contrast,our multi-view video summarization method shots are taken into account.In previous methods for mono-view makes an effort to pave the way for this,by exploring the con- video summarization,each shot only correlates with its sim- tent correlations among multi-view video shots and selecting ilar shots along the temporal axis.The correlations are simple, those most representative shots for summary. and easily modeled.However,for the multi-view videos,each shot correlates closely with not only the temporally adjacent III.OVERVIEW shots in its own view but also the spatially neighboring shots We construct a spatio-temporal shot graph to represent the in other views.Relationships among shots increase exponen- multi-view videos.Multi-view summarization is achieved tially relative to the mono-view video,and the correlations are through event-centered shot clustering via random walks and thus very complicated.To better explore such correlations,we multi-objective optimization.Spatio-temporal shot graph con-consider them with different attributes,for instance,temporal struction and the multi-view summarization are the two key adjacency,content similarity,and high-level semantic correla- components.The overview of our method is shown in Fig.1. tion separately.A hypergraph is initially introduced to systemat- To construct the shot graph,we first parse the input multi- ically model the correlations in which each graph node denotes a view videos into content-consistent video shots.Dynamic and shot resulting from video parsing,while each type of hyperedge important static shots are reserved as a result.The preserved characterizes the relationship among shots.We then transform shots are used as graph nodes and the corresponding shot impor-the hypergraph into a weighted spatio-temporal shot graph.The tance values are used as node values.For evaluating the impor- weights on graph edges thus qualitatively evaluate correlations tance,a Gaussian entropy fusion model is developed to fuse to- among multi-view shots.FU et al.: MULTI-VIEW VIDEO SUMMARIZATION 719 Fig. 1. Overview of our multi-view video summarization method. Besides, automatic music summarization has been considered in [28]. Graph model has also been used for video summarization. Lu et al. [10] developed a graph optimization method that computes optimal video skim in each scene via dynamic programming. Ngo et al. [13] used temporal graph analysis to effectively capsulate information for video structure and highlight. Through modeling the video evolution by temporal graph, their method can automatically detect scene changes and generate summaries. Lee et al. [29] presented a scenario-based dynamic video abstraction method using graph matching. Multi-level scenarios generated by a graph-based video seg￾mentation and a hierarchical segment are used to segment a video into shots. Dynamic video abstractions are accomplished by accessing the hierarchy level-by-level. Another graph-based video summarization method is given by Peng and Ngo [14]. Highlighted events can be detected by a graph clustering al￾gorithm, incorporating an effective similarity metric of video clips. Comparing with their methods, we focus on multi-view videos. Due to content correlations among multi-views, the spatio-temporal shot graph we constructed has more compli￾cated node connections, making summarization challenging. The above methods provide many effective solutions to mono-view video summarization. However, to the best of our knowledge, few methods are dedicated to multi-view video summarization. Multi-view video coding (MVC) algorithms [30]–[32] also deal with the multi-view videos. Using tech￾niques such as motion estimation, disparity estimation, and so on, MVC removes information redundancy in spatial and temporal domains. The video content is however unchanged. Therefore, MVC could not remove redundancy at the semantic level. In contrast, our multi-view video summarization method makes an effort to pave the way for this, by exploring the con￾tent correlations among multi-view video shots and selecting those most representative shots for summary. III. OVERVIEW We construct a spatio-temporal shot graph to represent the multi-view videos. Multi-view summarization is achieved through event-centered shot clustering via random walks and multi-objective optimization. Spatio-temporal shot graph con￾struction and the multi-view summarization are the two key components. The overview of our method is shown in Fig. 1. To construct the shot graph, we first parse the input multi￾view videos into content-consistent video shots. Dynamic and important static shots are reserved as a result. The preserved shots are used as graph nodes and the corresponding shot impor￾tance values are used as node values. For evaluating the impor￾tance, a Gaussian entropy fusion model is developed to fuse to￾gether a set of intrinsic video features. The multi-view shots usu￾ally have diverse correlations with different attributes, such as temporal adjacency and content similarity. We use a hypergraph to systematically characterize the correlations among shots. A hypergraph is a graph in which an edge, usually named as a hy￾peredge, can link a subset of nodes. Each kind of correlation among multi-view shots is thus represented with a kind of hy￾peredge in the hypergraph. The hypergraph is further converted into a spatio-temporal shot graph where correlations of shots in each view and across multi-views are mapped to edge weights. To implement multi-view summarization on the spatio-tem￾poral graph, we employ random walks to cluster those event￾centered similar shots. Using them as the anchor points, final summarized multi-view shots are generated by a multi-objective optimization model that supports different user requirements as well as multi-level summarization. We use the multi-view video storyboard and the event-board to represent the multi-view summaries. The multi-view story￾board demonstrates the event-centered summarized shots in a multi-view manner as shown in Fig. 5. In contrast, the event￾board shown in Fig. 6 assembles those summarized shots along the timeline. IV. SPATIO-TEMPORAL SHOT GRAPH It is difficult to directly generate summarization, especially the video skims from multi-view videos. A common idea is to first parse the videos into shots. In this way, video summariza￾tion is transformed into a problem of selecting a set of repre￾sentative shots. Obviously, the selected shots should favor inter￾esting events. Meanwhile, these shots should be nontrivial. To achieve this, content correlations as well as disparities among shots are taken into account. In previous methods for mono-view video summarization, each shot only correlates with its sim￾ilar shots along the temporal axis. The correlations are simple, and easily modeled. However, for the multi-view videos, each shot correlates closely with not only the temporally adjacent shots in its own view but also the spatially neighboring shots in other views. Relationships among shots increase exponen￾tially relative to the mono-view video, and the correlations are thus very complicated. To better explore such correlations, we consider them with different attributes, for instance, temporal adjacency, content similarity, and high-level semantic correla￾tion separately. A hypergraph is initially introduced to systemat￾ically model the correlations in which each graph node denotes a shot resulting from video parsing, while each type of hyperedge characterizes the relationship among shots. We then transform the hypergraph into a weighted spatio-temporal shot graph. The weights on graph edges thus qualitatively evaluate correlations among multi-view shots
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有