IEEE TRANSACTIONS ON MULTIMEDIA.VOL 12.NO.7.NOVEMBER 2010 717 Multi-View Video Summarization Yanwei Fu,Yanwen Guo,Yanshu Zhu,Feng Liu,Chuanming Song,and Zhi-Hua Zhou,Senior Member;IEEE Abstract-Previous video summarization studies focused on instance,watching a large number of videos to grasp important monocular videos,and the results would not be good if they were information quickly is a big challenge. applied to multi-view videos directly,due to problems such as the redundancy in multiple views.In this paper,we present a method Video summarization,as an important video content service, for summarizing multi-view videos.We construct a spatio-tem- produces a condensed and succinct representation of video poral shot graph and formulate the summarization problem as content,which facilitates the browsing,retrieval,and storage a graph labeling task.The spatio-temporal shot graph is derived of the original videos.There has been a rich literature on from a hypergraph,which encodes the correlations with different summarizing a long video into a concise representation,such as attributes among multi-view video shots in hyperedges.We then partition the shot graph and identify clusters of event-centered a key-frame sequence [1]-[6]and a video skim [7]-[20].These shots with similar contents via random walks.The summarization existing methods provide effective solutions to summarization. result is generated through solving a multi-objective optimization However,they focus on monocular videos.Multi-view video problem based on shot importance evaluated using a Gaussian en- summarization has been rarely addressed,though multi-view tropy fusion scheme.Different summarization objectives,such as minimum summary length and maximum information coverage, videos are widely used in surveillance systems equipped in can be accomplished in the framework.Moreover,multi-level sum- offices,banks,factories,and crossroads of cities for private and marization can be achieved easily by configuring the optimization public securities.For the all-weather,day,and night multi-view parameters.We also propose the multi-view storyboard and event surveillance systems,video data recorded increases dramati- board for presenting multi-view summaries.The storyboard naturally reflects correlations among multi-view summarized cally every day.In addition to surveillance,multi-view videos shots that describe the same important event.The event-board are also popular in sports broadcast.For example,in the soccer serially assembles event-centered multi-view shots in temporal match,the cameramen usually replay the goals recorded by dif- order.Single video summary which facilitates quick browsing of ferent cameras distributed in the football stadium.Multi-view the summarized multi-view video can be easily generated based on the event board representation. video summarization refers to the problem of summarizing multi-view videos into informative video summaries,usually Index Terms-Multi-objective optimization,multi-view video, presented as dynamic video shots coming from multi-views,by random walks,spatio-temporal graph,video summarization. considering content correlations within each view and among multiple views.The multi-view summaries will provide salient I.INTRODUCTION events with more rich information than less salient ones.This T ITH the rapid development of computation,com- will allow the user to grasp the important information from mul- tiple perspectives of the multi-view videos without watching munication,and storage infrastructures,multi-view the whole of them.Multi-view summarization will also benefit video systems that simultaneously capture a group of videos the storage,analysis,and management of multi-view video and record the video content of the occurrence of events with content. considerable overlapping field of views(FOVs)across multiple Applying the existing monocular video summarization cameras have become more and more popular.In contrast to the methods to each component of a multi-view video group could rapid development of video collection and storage techniques, lead to a redundant summarization result as each component consuming these multi-view videos still remains a problem.For has overlapping information with the others.To generate a concise multi-view video summary,information correlations Manuscript received August 28,2009;revised December 21,2009 and as well as discrepancies among multi-view videos should be March 26.2010:accepted May 18,2010.Date of publication June 07. 2010:date of current version October 15.2010.This work was supported in taken into account.It is also not good to directly apply previous part by the National Science Foundation of China under Grants 60703084. methods to the video sequence formed by simply combining 60723003,and 60721002,the National Fundamental Research Program of the multi-view videos.Furthermore.since multi-view videos China (2010CB327903),the Jiangsu Science Foundation (BK2009081),and often suffer from different lighting conditions in distinctive the Jiangsu 333 High-Level Talent Cultivation Program.The associate editor coordinating the review of this manuscript and approving it for publication was views.it is nontrivial to evaluate the importance of shots in Dr.Zhu Liu. each view video and to merge each component into an integral Y.Fu,Y.Zhu,C.Song,and Z-H.Zhou are with the National Key Lab video summary in a robust way,especially when the multi-view for Novel Software Technology,Nanjing University,Nanjing 210093,China (e-mail:ztwztq2006@gmail.com;yszhu@cs.hku.hk;chmsong @graphics.nju. videos are captured nonsynchronously.It is thus important to edu.cn:zhouzh@nju.edu.cn). have effective multi-view summarization techniques. Y.Guo is with the National Key Lab for Novel Software Technology,Nan- In this paper,we present a method for the summarization of jing University,Nanjing 210093,China,and also with the Jiangyin Information Technology Research Institute of Nanjing University (e-mail:ywguo@nju.edu. multi-view videos.We first parse the video from each view into cn). shots.Content correlations among multi-view shots are impor- F.Liu is with the Department of Computer Sciences.University of Wis- tant to produce an informative and compact summary.We use consin-Madison.Madison.WI 53562 USA (e-mail:fliu@cs.wisc.edu). Color versions of one or more of the figures in this paper are available online a hypergraph to model such correlations,in which each kind at http://ieeexplore.ieee.org. of hyperedge characterizes a kind of correlation among shots. Digital Object Identifier 10.1109/TMM.2010.2052025 By converting the hypergraph into a spatio-temporal shot graph, 1520-9210/S26.00©2010IEEE
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 717 Multi-View Video Summarization Yanwei Fu, Yanwen Guo, Yanshu Zhu, Feng Liu, Chuanming Song, and Zhi-Hua Zhou, Senior Member, IEEE Abstract—Previous video summarization studies focused on monocular videos, and the results would not be good if they were applied to multi-view videos directly, due to problems such as the redundancy in multiple views. In this paper, we present a method for summarizing multi-view videos. We construct a spatio-temporal shot graph and formulate the summarization problem as a graph labeling task. The spatio-temporal shot graph is derived from a hypergraph, which encodes the correlations with different attributes among multi-view video shots in hyperedges. We then partition the shot graph and identify clusters of event-centered shots with similar contents via random walks. The summarization result is generated through solving a multi-objective optimization problem based on shot importance evaluated using a Gaussian entropy fusion scheme. Different summarization objectives, such as minimum summary length and maximum information coverage, can be accomplished in the framework. Moreover, multi-level summarization can be achieved easily by configuring the optimization parameters. We also propose the multi-view storyboard and event board for presenting multi-view summaries. The storyboard naturally reflects correlations among multi-view summarized shots that describe the same important event. The event-board serially assembles event-centered multi-view shots in temporal order. Single video summary which facilitates quick browsing of the summarized multi-view video can be easily generated based on the event board representation. Index Terms—Multi-objective optimization, multi-view video, random walks, spatio-temporal graph, video summarization. I. INTRODUCTION WITH the rapid development of computation, communication, and storage infrastructures, multi-view video systems that simultaneously capture a group of videos and record the video content of the occurrence of events with considerable overlapping field of views (FOVs) across multiple cameras have become more and more popular. In contrast to the rapid development of video collection and storage techniques, consuming these multi-view videos still remains a problem. For Manuscript received August 28, 2009; revised December 21, 2009 and March 26, 2010; accepted May 18, 2010. Date of publication June 07, 2010; date of current version October 15, 2010. This work was supported in part by the National Science Foundation of China under Grants 60703084, 60723003, and 60721002, the National Fundamental Research Program of China (2010CB327903), the Jiangsu Science Foundation (BK2009081), and the Jiangsu 333 High-Level Talent Cultivation Program. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Zhu Liu. Y. Fu, Y. Zhu, C. Song, and Z.-H. Zhou are with the National Key Lab for Novel Software Technology, Nanjing University, Nanjing 210093, China (e-mail: ztwztq2006@gmail.com; yszhu@cs.hku.hk; chmsong@graphics.nju. edu.cn; zhouzh@nju.edu.cn). Y. Guo is with the National Key Lab for Novel Software Technology, Nanjing University, Nanjing 210093, China, and also with the Jiangyin Information Technology Research Institute of Nanjing University (e-mail: ywguo@nju.edu. cn). F. Liu is with the Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53562 USA (e-mail: fliu@cs.wisc.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2010.2052025 instance, watching a large number of videos to grasp important information quickly is a big challenge. Video summarization, as an important video content service, produces a condensed and succinct representation of video content, which facilitates the browsing, retrieval, and storage of the original videos. There has been a rich literature on summarizing a long video into a concise representation, such as a key-frame sequence [1]–[6] and a video skim [7]–[20]. These existing methods provide effective solutions to summarization. However, they focus on monocular videos. Multi-view video summarization has been rarely addressed, though multi-view videos are widely used in surveillance systems equipped in offices, banks, factories, and crossroads of cities for private and public securities. For the all-weather, day, and night multi-view surveillance systems, video data recorded increases dramatically every day. In addition to surveillance, multi-view videos are also popular in sports broadcast. For example, in the soccer match, the cameramen usually replay the goals recorded by different cameras distributed in the football stadium. Multi-view video summarization refers to the problem of summarizing multi-view videos into informative video summaries, usually presented as dynamic video shots coming from multi-views, by considering content correlations within each view and among multiple views. The multi-view summaries will provide salient events with more rich information than less salient ones. This will allow the user to grasp the important information from multiple perspectives of the multi-view videos without watching the whole of them. Multi-view summarization will also benefit the storage, analysis, and management of multi-view video content. Applying the existing monocular video summarization methods to each component of a multi-view video group could lead to a redundant summarization result as each component has overlapping information with the others. To generate a concise multi-view video summary, information correlations as well as discrepancies among multi-view videos should be taken into account. It is also not good to directly apply previous methods to the video sequence formed by simply combining the multi-view videos. Furthermore, since multi-view videos often suffer from different lighting conditions in distinctive views, it is nontrivial to evaluate the importance of shots in each view video and to merge each component into an integral video summary in a robust way, especially when the multi-view videos are captured nonsynchronously. It is thus important to have effective multi-view summarization techniques. In this paper, we present a method for the summarization of multi-view videos. We first parse the video from each view into shots. Content correlations among multi-view shots are important to produce an informative and compact summary. We use a hypergraph to model such correlations, in which each kind of hyperedge characterizes a kind of correlation among shots. By converting the hypergraph into a spatio-temporal shot graph, 1520-9210/$26.00 © 2010 IEEE
718 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 the edge weights can qualitatively measure similarities among [3]divided video sequence into clusters and selected optimal shots.We associate the value of a graph node with shot impor- ones using an unsupervised procedure for cluster-validity anal- tance computed by a Gaussian entropy fusion scheme.Such ysis.The centroids of clusters are chosen as key frames.Li a scheme can calculate the importance of shots in the pres- et al.[4]formulated key frame extraction as a rate-distortion ence of brightness difference and conspicuous noises,by em- Min-max optimization problem.The optimal solution is solved phasizing useful information and precluding redundancy among by dynamic programming.Besides,Orriols et al.[5]addressed video features.With the graph representation,the final summary summarization under a Bayesian framework.An EM algorithm is generated through the event clustering based on random walks with a generative model is developed to generate representative and a multi-objective optimization process. frames.Note that key frames can be transformed into skim by To the best of our knowledge,this paper presents the first joining up the segments that enclose them,and vice versa. multi-view video summarization method.It has the following In contrast to key frames,an advantage of video skim is that features. signals in other modalities such as audio information can be in- A spatio-temporal shot graph is used for the representa- cluded.Furthermore,skim preserves the time-evolving nature tion of multi-view videos.Such a representation makes the of the original video,making it more interesting and impres- multi-view summarization problem tractable in the light sive.Video saliency is necessary for summarization to produce of graph theory.The shot graph is derived from a hyper- the representative skim.For static image,Ma et al.[22]calcu- graph which embeds different correlations among video lated visual feature contrast as saliency.A normalized saliency shots within each view as well as across multiple views. value for each pixel is computed.To evaluate saliency of video Random walks are used to cluster the event-centered shot sequence,multi-modal features such as motion vector and audio clusters,and the final summary is generated by multi-ob- frequency should be considered [11],[16],[19].Ma et al.[11] jective optimization.The multi-objective optimization can presented a generic framework of user attention model through be flexibly configured to meet different summarization multiple sensory perceptions.Visual and aural attentions are requirements.Additionally,multi-level summaries can be fused into an attention curve,based on which key frames and achieved easily through setting different parameters.In video skims are extracted around the crests.Recently,You et al. contrast,most previous methods can only summarize the [19]also introduced a method for human perception analysis videos from a specific perspective on the summaries. by combining motion,contrast,special scenes,and statistical The multi-view video storyboard and the event-board are rhythm cues.They constructed a perception curve for labeling presented for representing multi-view video summary. three-level summary,namely,keywords,key frames,and video skim. The storyboard naturally reflects correlations among multi-view summarized shots that describe the same Various mechanisms have been used to generate video skim. important event.The event-board serially assembles Nam et al.[12]proposed to adaptively sample the video with visual activity-based sampling rate.Semantically meaningful event-centered multi-view shots in temporal order.With summaries are achieved through an event-oriented abstraction. the event-board,a single video summary that facilitates quick browsing of the summarized video can be easily By measuring shots'visual complexity and analyzing speech data,Sundaram et al.[17]generated audio-visual skims with generated. constrained utility maximization that maximizes information The rest of this paper is organized as follows.We briefly re- view previous work in Section II.In Section III,we present a content and coherence.Since summarization can be viewed as a dimension reduction problem,Gong and Liu proposed high-level overview of our method.The two key components of to summarize video by using singular value decomposition our method,spatio-temporal shot graph construction and multi- (SVD)[9].The SVD properties they derived help to output view summarization,are presented in Sections IV and V,re- the skim with user-specified length.Gong's another method spectively.We evaluate our method in Section VI and conclude [8]produces video summary by minimizing visual content the paper in Section VII. redundancy of the input video.Previous viewers'browsing log will assist in future viewers.Yu et al.'s method [20]learns user Ⅱ.RELATED WORK understanding of video content.A ShotRank is constructed to This paper is made possible by many inspirations from pre- measure importance of video shot.The top ranking shots are vious work on video summarization.A comprehensive review chosen as video skim. of the state-of-the-art video summarization methods can be Some techniques for generating video skims are domain-de- found in [21].In general,two basic forms of video summaries pendent.For example,Babaguchi [7]presented an approach exist,i.e.,the static key frames and dynamic video skim.The for abstracting soccer game videos by highlights.Using event- former consists of a collection of salient images fetched from based indexing,an abstracted video clip is automatically cre- the original video sequence,while the latter is composed of the ated based on impact factors of events.Soccer events can be most representative video segments extracted from the video detected by using temporal logic models [23]or goalmouth de- source. tection [24].Much attention has been paid to rush video sum- Key frame extraction should take into account the underlying marization [25-27.Rush videos often contain redundant and dynamics of video content.DeMenthon et al.[1]regarded video repetitive contents,by exploring which a concise summary can sequence as a curve in high-dimensional space.The curve is be generated.The methods in [15]and [18]focus on summa- recursively simplified with a tree structure representation.The rizing music videos via the analysis of audio,visual,and text frames corresponding to junctions between curve segments at The summary is generated based on the alignment of boundaries different tree levels are viewed as key frames.Hanjalic et al. of the chorus,shot class,and repeated lyrics of the music video
718 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 the edge weights can qualitatively measure similarities among shots. We associate the value of a graph node with shot importance computed by a Gaussian entropy fusion scheme. Such a scheme can calculate the importance of shots in the presence of brightness difference and conspicuous noises, by emphasizing useful information and precluding redundancy among video features. With the graph representation, the final summary is generated through the event clustering based on random walks and a multi-objective optimization process. To the best of our knowledge, this paper presents the first multi-view video summarization method. It has the following features. • A spatio-temporal shot graph is used for the representation of multi-view videos. Such a representation makes the multi-view summarization problem tractable in the light of graph theory. The shot graph is derived from a hypergraph which embeds different correlations among video shots within each view as well as across multiple views. • Random walks are used to cluster the event-centered shot clusters, and the final summary is generated by multi-objective optimization. The multi-objective optimization can be flexibly configured to meet different summarization requirements. Additionally, multi-level summaries can be achieved easily through setting different parameters. In contrast, most previous methods can only summarize the videos from a specific perspective on the summaries. • The multi-view video storyboard and the event-board are presented for representing multi-view video summary. The storyboard naturally reflects correlations among multi-view summarized shots that describe the same important event. The event-board serially assembles event-centered multi-view shots in temporal order. With the event-board, a single video summary that facilitates quick browsing of the summarized video can be easily generated. The rest of this paper is organized as follows. We briefly review previous work in Section II. In Section III, we present a high-level overview of our method. The two key components of our method, spatio-temporal shot graph construction and multiview summarization, are presented in Sections IV and V, respectively. We evaluate our method in Section VI and conclude the paper in Section VII. II. RELATED WORK This paper is made possible by many inspirations from previous work on video summarization. A comprehensive review of the state-of-the-art video summarization methods can be found in [21]. In general, two basic forms of video summaries exist, i.e., the static key frames and dynamic video skim. The former consists of a collection of salient images fetched from the original video sequence, while the latter is composed of the most representative video segments extracted from the video source. Key frame extraction should take into account the underlying dynamics of video content. DeMenthon et al. [1] regarded video sequence as a curve in high-dimensional space. The curve is recursively simplified with a tree structure representation. The frames corresponding to junctions between curve segments at different tree levels are viewed as key frames. Hanjalic et al. [3] divided video sequence into clusters and selected optimal ones using an unsupervised procedure for cluster-validity analysis. The centroids of clusters are chosen as key frames. Li et al. [4] formulated key frame extraction as a rate-distortion Min-max optimization problem. The optimal solution is solved by dynamic programming. Besides, Orriols et al. [5] addressed summarization under a Bayesian framework. An EM algorithm with a generative model is developed to generate representative frames. Note that key frames can be transformed into skim by joining up the segments that enclose them, and vice versa. In contrast to key frames, an advantage of video skim is that signals in other modalities such as audio information can be included. Furthermore, skim preserves the time-evolving nature of the original video, making it more interesting and impressive. Video saliency is necessary for summarization to produce the representative skim. For static image, Ma et al. [22] calculated visual feature contrast as saliency. A normalized saliency value for each pixel is computed. To evaluate saliency of video sequence, multi-modal features such as motion vector and audio frequency should be considered [11], [16], [19]. Ma et al. [11] presented a generic framework of user attention model through multiple sensory perceptions. Visual and aural attentions are fused into an attention curve, based on which key frames and video skims are extracted around the crests. Recently, You et al. [19] also introduced a method for human perception analysis by combining motion, contrast, special scenes, and statistical rhythm cues. They constructed a perception curve for labeling three-level summary, namely, keywords, key frames, and video skim. Various mechanisms have been used to generate video skim. Nam et al. [12] proposed to adaptively sample the video with visual activity-based sampling rate. Semantically meaningful summaries are achieved through an event-oriented abstraction. By measuring shots’ visual complexity and analyzing speech data, Sundaram et al. [17] generated audio-visual skims with constrained utility maximization that maximizes information content and coherence. Since summarization can be viewed as a dimension reduction problem, Gong and Liu proposed to summarize video by using singular value decomposition (SVD) [9]. The SVD properties they derived help to output the skim with user-specified length. Gong’s another method [8] produces video summary by minimizing visual content redundancy of the input video. Previous viewers’ browsing log will assist in future viewers. Yu et al.’s method [20] learns user understanding of video content. A ShotRank is constructed to measure importance of video shot. The top ranking shots are chosen as video skim. Some techniques for generating video skims are domain-dependent. For example, Babaguchi [7] presented an approach for abstracting soccer game videos by highlights. Using eventbased indexing, an abstracted video clip is automatically created based on impact factors of events. Soccer events can be detected by using temporal logic models [23] or goalmouth detection [24]. Much attention has been paid to rush video summarization [25]–[27]. Rush videos often contain redundant and repetitive contents, by exploring which a concise summary can be generated. The methods in [15] and [18] focus on summarizing music videos via the analysis of audio, visual, and text. The summary is generated based on the alignment of boundaries of the chorus, shot class, and repeated lyrics of the music video
FU et al:MULTI-VIEW VIDEO SUMMARIZATION 719 View 1 Video View 2 Video Video Multi-view Importance Spatio-temporal RandomShot Multi-objective Parsing Video Shots Computation Shot Graph Walks Clusters Optimization View N Video Fig.1.Overview of our multi-view video summarization method. Besides,automatic music summarization has been considered gether a set of intrinsic video features.The multi-view shots usu in[28]. ally have diverse correlations with different attributes,such as Graph model has also been used for video summarization. temporal adjacency and content similarity.We use a hypergraph Lu et al.[10]developed a graph optimization method that to systematically characterize the correlations among shots.A computes optimal video skim in each scene via dynamic hypergraph is a graph in which an edge,usually named as a hy- programming.Ngo et al.[13]used temporal graph analysis peredge,can link a subset of nodes.Each kind of correlation to effectively capsulate information for video structure and among multi-view shots is thus represented with a kind of hy- highlight.Through modeling the video evolution by temporal peredge in the hypergraph.The hypergraph is further converted graph,their method can automatically detect scene changes and into a spatio-temporal shot graph where correlations of shots in generate summaries.Lee et al.[29]presented a scenario-based each view and across multi-views are mapped to edge weights. dynamic video abstraction method using graph matching. To implement multi-view summarization on the spatio-tem- Multi-level scenarios generated by a graph-based video seg- poral graph,we employ random walks to cluster those event- mentation and a hierarchical segment are used to segment a centered similar shots.Using them as the anchor points,final video into shots.Dynamic video abstractions are accomplished summarized multi-view shots are generated by a multi-objective by accessing the hierarchy level-by-level.Another graph-based optimization model that supports different user requirements as video summarization method is given by Peng and Ngo [14]. well as multi-level summarization. Highlighted events can be detected by a graph clustering al- We use the multi-view video storyboard and the event-board gorithm,incorporating an effective similarity metric of video to represent the multi-view summaries.The multi-view story- clips.Comparing with their methods,we focus on multi-view board demonstrates the event-centered summarized shots in a videos.Due to content correlations among multi-views,the multi-view manner as shown in Fig.5.In contrast,the event- spatio-temporal shot graph we constructed has more compli- board shown in Fig.6 assembles those summarized shots along cated node connections,making summarization challenging. the timeline. The above methods provide many effective solutions to mono-view video summarization.However,to the best of our IV.SPATIO-TEMPORAL SHOT GRAPH knowledge,few methods are dedicated to multi-view video summarization.Multi-view video coding (MVC)algorithms It is difficult to directly generate summarization,especially the video skims from multi-view videos.A common idea is to [30]-[32]also deal with the multi-view videos.Using tech- first parse the videos into shots.In this way,video summariza- niques such as motion estimation,disparity estimation,and tion is transformed into a problem of selecting a set of repre- so on,MVC removes information redundancy in spatial and sentative shots.Obviously,the selected shots should favor inter- temporal domains.The video content is however unchanged. esting events.Meanwhile,these shots should be nontrivial.To Therefore,MVC could not remove redundancy at the semantic achieve this,content correlations as well as disparities among level.In contrast,our multi-view video summarization method shots are taken into account.In previous methods for mono-view makes an effort to pave the way for this,by exploring the con- video summarization,each shot only correlates with its sim- tent correlations among multi-view video shots and selecting ilar shots along the temporal axis.The correlations are simple, those most representative shots for summary. and easily modeled.However,for the multi-view videos,each shot correlates closely with not only the temporally adjacent III.OVERVIEW shots in its own view but also the spatially neighboring shots We construct a spatio-temporal shot graph to represent the in other views.Relationships among shots increase exponen- multi-view videos.Multi-view summarization is achieved tially relative to the mono-view video,and the correlations are through event-centered shot clustering via random walks and thus very complicated.To better explore such correlations,we multi-objective optimization.Spatio-temporal shot graph con-consider them with different attributes,for instance,temporal struction and the multi-view summarization are the two key adjacency,content similarity,and high-level semantic correla- components.The overview of our method is shown in Fig.1. tion separately.A hypergraph is initially introduced to systemat- To construct the shot graph,we first parse the input multi- ically model the correlations in which each graph node denotes a view videos into content-consistent video shots.Dynamic and shot resulting from video parsing,while each type of hyperedge important static shots are reserved as a result.The preserved characterizes the relationship among shots.We then transform shots are used as graph nodes and the corresponding shot impor-the hypergraph into a weighted spatio-temporal shot graph.The tance values are used as node values.For evaluating the impor- weights on graph edges thus qualitatively evaluate correlations tance,a Gaussian entropy fusion model is developed to fuse to- among multi-view shots
FU et al.: MULTI-VIEW VIDEO SUMMARIZATION 719 Fig. 1. Overview of our multi-view video summarization method. Besides, automatic music summarization has been considered in [28]. Graph model has also been used for video summarization. Lu et al. [10] developed a graph optimization method that computes optimal video skim in each scene via dynamic programming. Ngo et al. [13] used temporal graph analysis to effectively capsulate information for video structure and highlight. Through modeling the video evolution by temporal graph, their method can automatically detect scene changes and generate summaries. Lee et al. [29] presented a scenario-based dynamic video abstraction method using graph matching. Multi-level scenarios generated by a graph-based video segmentation and a hierarchical segment are used to segment a video into shots. Dynamic video abstractions are accomplished by accessing the hierarchy level-by-level. Another graph-based video summarization method is given by Peng and Ngo [14]. Highlighted events can be detected by a graph clustering algorithm, incorporating an effective similarity metric of video clips. Comparing with their methods, we focus on multi-view videos. Due to content correlations among multi-views, the spatio-temporal shot graph we constructed has more complicated node connections, making summarization challenging. The above methods provide many effective solutions to mono-view video summarization. However, to the best of our knowledge, few methods are dedicated to multi-view video summarization. Multi-view video coding (MVC) algorithms [30]–[32] also deal with the multi-view videos. Using techniques such as motion estimation, disparity estimation, and so on, MVC removes information redundancy in spatial and temporal domains. The video content is however unchanged. Therefore, MVC could not remove redundancy at the semantic level. In contrast, our multi-view video summarization method makes an effort to pave the way for this, by exploring the content correlations among multi-view video shots and selecting those most representative shots for summary. III. OVERVIEW We construct a spatio-temporal shot graph to represent the multi-view videos. Multi-view summarization is achieved through event-centered shot clustering via random walks and multi-objective optimization. Spatio-temporal shot graph construction and the multi-view summarization are the two key components. The overview of our method is shown in Fig. 1. To construct the shot graph, we first parse the input multiview videos into content-consistent video shots. Dynamic and important static shots are reserved as a result. The preserved shots are used as graph nodes and the corresponding shot importance values are used as node values. For evaluating the importance, a Gaussian entropy fusion model is developed to fuse together a set of intrinsic video features. The multi-view shots usually have diverse correlations with different attributes, such as temporal adjacency and content similarity. We use a hypergraph to systematically characterize the correlations among shots. A hypergraph is a graph in which an edge, usually named as a hyperedge, can link a subset of nodes. Each kind of correlation among multi-view shots is thus represented with a kind of hyperedge in the hypergraph. The hypergraph is further converted into a spatio-temporal shot graph where correlations of shots in each view and across multi-views are mapped to edge weights. To implement multi-view summarization on the spatio-temporal graph, we employ random walks to cluster those eventcentered similar shots. Using them as the anchor points, final summarized multi-view shots are generated by a multi-objective optimization model that supports different user requirements as well as multi-level summarization. We use the multi-view video storyboard and the event-board to represent the multi-view summaries. The multi-view storyboard demonstrates the event-centered summarized shots in a multi-view manner as shown in Fig. 5. In contrast, the eventboard shown in Fig. 6 assembles those summarized shots along the timeline. IV. SPATIO-TEMPORAL SHOT GRAPH It is difficult to directly generate summarization, especially the video skims from multi-view videos. A common idea is to first parse the videos into shots. In this way, video summarization is transformed into a problem of selecting a set of representative shots. Obviously, the selected shots should favor interesting events. Meanwhile, these shots should be nontrivial. To achieve this, content correlations as well as disparities among shots are taken into account. In previous methods for mono-view video summarization, each shot only correlates with its similar shots along the temporal axis. The correlations are simple, and easily modeled. However, for the multi-view videos, each shot correlates closely with not only the temporally adjacent shots in its own view but also the spatially neighboring shots in other views. Relationships among shots increase exponentially relative to the mono-view video, and the correlations are thus very complicated. To better explore such correlations, we consider them with different attributes, for instance, temporal adjacency, content similarity, and high-level semantic correlation separately. A hypergraph is initially introduced to systematically model the correlations in which each graph node denotes a shot resulting from video parsing, while each type of hyperedge characterizes the relationship among shots. We then transform the hypergraph into a weighted spatio-temporal shot graph. The weights on graph edges thus qualitatively evaluate correlations among multi-view shots
720 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 1 View I video S22 S23 View 2 video S31 View N video■ Fig.2.Spatio-temporal shot graph G(V.E.W).Each node in G represents a shot,and its value is the shot importance.Each edge connects a pair of nodes (shots) with correlation which is evaluated by shots'similarity.Without losing generality,only three shots in each view are given for illustration.To expose clearly the graph,the edges in graph are with several colors,and each shot is represented as an orange segment. A.Graph Construction a kind of correlation.By converting the hypergraph into the We first parse the multi-view videos into shots.Various algo- spaito-temporal graph,the edge weights quantitatively evaluate rithms have been proposed for shot detection [33]-[37].In [34], correlations among shots.Note that the shot graph is called a Ngo et al.proposed an approach through the analysis of slices spatio-temporal graph in the sense that it embeds the scene in- extracted by partitioning video and collecting temporal signa- formation coming from different spatial views.The"spatio-tem- ture.It has proven effective in detecting camera breaks such as poral"here differs from its traditional definition on the monoc- cuts,wipes,and dissolves.Xiang et al.[36]used a cumulative ular video sequence. multi-event histogram over time to represent video content.An By representing the multi-view videos as the spatio-temporal online segmentation algorithm named forward-backward rele- shot graph,correlations among shots are naturally and intu- vance is developed to detect breaks in video content.For multi- itively reflected in the graph.Moreover,the graph nodes carry view videos,especially those surveillance videos,the cameras shot importance,which is necessary to create a concise and rep- remain nearly stable,and the videos recorded only contain the resentative summary.We describe the Gaussian entropy fusion same scene in most cases.The shot mainly contains those tem- model and hypergraph in Sections IV-B and IV-C separately. porally contiguous frames which share the same semantic con- cept with relatively higher probability.To detect the shots,we B.Shot Importance Computation basically adopt the algorithm proposed in [36],and further dis- By representing multi-view videos with graph,multi-view card those shots with lower activities.In particular,for every video summarization is converted into a task of selecting the shot detected,we first compute the differential image sequence most representative video shots.The selection of represen- of adjacent frames.Each image can then be converted into a tative shots often varies with different people.In this sense, binary image by comparing the absolute value of each pixel detecting representative shots generally involves understanding against a threshold.We compute for each shot a normalized ac- video content based on human perception and is very difficult. tivity value through counting the total number of its nonzero To make it computationally tractable,we instead quantita- pixels and dividing it by the product of frame number and frame tively evaluate the shot importance by considering low-level resolution.We sort the activity values of all shots and select the image features as well as high-level semantics.We introduce activity threshold interactively. a Gaussian entropy fusion model to fuse a set of low-level Parsing the multi-view videos into shots allows us to seek features such as color histogram and wavelet coefficients,and solution of summarization in a more compact shot space.Actu-compute an importance score.For high-level semantics,we ally,each shot correlates with the similar shots in its own view mainly consider human faces now.Moreover,we take into as well as the ones in other views.This characteristic makes the account the interesting events for specific types of videos,since weighted graph a suitable representation of multi-view videos, video summarization is often domain-specific. by viewing shots as nodes and converting the correlations 1)Shot Importance by Low-Level Features:We develop a between shots into edge weights.We extend the graph model Gaussian entropy fusion model to measure shot information by for mono-view video summarization [10,[13,[14]and seg- integrating low-level features.In contrast,previous mono-view mentation38]to a spatio-temporal shot graph.Connectivity of video summarization methods generally combine features with the graph we constructed is inherently complicated due to the linear or nonlinear fusion schemes.Such schemes would not spatio-temporal correlations among multi-view shots necessarily lead to the optimal performance for our multi-view The multi-view videos are treated as a weighted undirected videos when the videos are contaminated by noises.This is espe- shot graph G(V,E,W)as illustrated in Fig.2.Each node in cially true for multi-view surveillance videos which often suffer V represents a shot resulting from video parsing.Its value is from different lighting conditions across multiple views.Under the importance of shot calculated by the Gaussian entropy fu- such circumstance,we should robustly and fairly evaluate the sion model.The edge set E connects every pair of nodes if importance of the shots that may capture the same interesting they are closely correlated.The edge weight W measures node event in multiple views under different illuminations.To ac- similarity by taking into account their correlations in terms of count for this,we need to emphasize the portion of shot-related different attributes.We model such correlations among shots useful information in multi-view videos,and depress the influ- with a hypergraph in which each type of hyperedge denotes ence of noises simultaneously.Based upon such observation,we
720 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 Fig. 2. Spatio-temporal shot graph . Each node in represents a shot, and its value is the shot importance. Each edge connects a pair of nodes (shots) with correlation which is evaluated by shots’ similarity. Without losing generality, only three shots in each view are given for illustration. To expose clearly the graph, the edges in graph are with several colors, and each shot is represented as an orange segment. A. Graph Construction We first parse the multi-view videos into shots. Various algorithms have been proposed for shot detection [33]–[37]. In [34], Ngo et al. proposed an approach through the analysis of slices extracted by partitioning video and collecting temporal signature. It has proven effective in detecting camera breaks such as cuts, wipes, and dissolves. Xiang et al. [36] used a cumulative multi-event histogram over time to represent video content. An online segmentation algorithm named forward-backward relevance is developed to detect breaks in video content. For multiview videos, especially those surveillance videos, the cameras remain nearly stable, and the videos recorded only contain the same scene in most cases. The shot mainly contains those temporally contiguous frames which share the same semantic concept with relatively higher probability. To detect the shots, we basically adopt the algorithm proposed in [36], and further discard those shots with lower activities. In particular, for every shot detected, we first compute the differential image sequence of adjacent frames. Each image can then be converted into a binary image by comparing the absolute value of each pixel against a threshold. We compute for each shot a normalized activity value through counting the total number of its nonzero pixels and dividing it by the product of frame number and frame resolution. We sort the activity values of all shots and select the activity threshold interactively. Parsing the multi-view videos into shots allows us to seek solution of summarization in a more compact shot space. Actually, each shot correlates with the similar shots in its own view as well as the ones in other views. This characteristic makes the weighted graph a suitable representation of multi-view videos, by viewing shots as nodes and converting the correlations between shots into edge weights. We extend the graph model for mono-view video summarization [10], [13], [14] and segmentation [38] to a spatio-temporal shot graph. Connectivity of the graph we constructed is inherently complicated due to the spatio-temporal correlations among multi-view shots. The multi-view videos are treated as a weighted undirected shot graph as illustrated in Fig. 2. Each node in represents a shot resulting from video parsing. Its value is the importance of shot calculated by the Gaussian entropy fusion model. The edge set connects every pair of nodes if they are closely correlated. The edge weight measures node similarity by taking into account their correlations in terms of different attributes. We model such correlations among shots with a hypergraph in which each type of hyperedge denotes a kind of correlation. By converting the hypergraph into the spaito-temporal graph, the edge weights quantitatively evaluate correlations among shots. Note that the shot graph is called a spatio-temporal graph in the sense that it embeds the scene information coming from different spatial views. The “spatio-temporal” here differs from its traditional definition on the monocular video sequence. By representing the multi-view videos as the spatio-temporal shot graph, correlations among shots are naturally and intuitively reflected in the graph. Moreover, the graph nodes carry shot importance, which is necessary to create a concise and representative summary. We describe the Gaussian entropy fusion model and hypergraph in Sections IV-B and IV-C separately. B. Shot Importance Computation By representing multi-view videos with graph, multi-view video summarization is converted into a task of selecting the most representative video shots. The selection of representative shots often varies with different people. In this sense, detecting representative shots generally involves understanding video content based on human perception and is very difficult. To make it computationally tractable, we instead quantitatively evaluate the shot importance by considering low-level image features as well as high-level semantics. We introduce a Gaussian entropy fusion model to fuse a set of low-level features such as color histogram and wavelet coefficients, and compute an importance score. For high-level semantics, we mainly consider human faces now. Moreover, we take into account the interesting events for specific types of videos, since video summarization is often domain-specific. 1) Shot Importance by Low-Level Features: We develop a Gaussian entropy fusion model to measure shot information by integrating low-level features. In contrast, previous mono-view video summarization methods generally combine features with linear or nonlinear fusion schemes. Such schemes would not necessarily lead to the optimal performance for our multi-view videos when the videos are contaminated by noises. This is especially true for multi-view surveillance videos which often suffer from different lighting conditions across multiple views. Under such circumstance, we should robustly and fairly evaluate the importance of the shots that may capture the same interesting event in multiple views under different illuminations. To account for this, we need to emphasize the portion of shot-related useful information in multi-view videos, and depress the influence of noises simultaneously. Based upon such observation, we
FU et al:MULTI-VIEW VIDEO SUMMARIZATION 721 first extract from the videos a set of intrinsic low-level features The entropy H is a measure of information encoded by the which are often correlated with each other shot S.We take it as the importance.An additional advantage We now mainly take into account the visual features.They of the Gaussian entropy fusion scheme is that it works well as are color histogram feature,edge histogram feature,and wavelet long as the union of feature vector groups covers most useful feature [9],[39],[40].The features in other modalities,such information of multi-view videos.Therefore,instead of using as textual and aural features used in previous video analysis all the feature sets,it would be sufficient if some well-defined methods [11],[19],however,can also be integrated into our feature sets are available. method.Without losing generality,for shot S with n frames, 2)Shot Importance by High-Level Semantics:Humans are suppose that overall M feature vector sets are extracted.We ex-usually important content in video sequence.We employ the pand each featureF into a one-column vector.Two arbitrary Viola-Jones face detector [42]to detect faces in each frame.In featuresF and Fi may have different dimensions.We denote addition,video summarization is often domain-specific.Defini- he featre sets by{E当 tion of shot importance may vary according to different video The feature sets contain shot-related useful information.Be- genres.For instance,in a baseball game video,the shots that sides,they are often contaminated by noises.The Gaussian en- contain"home run”,“catch”,and“hit”usually catch much user tropy fusion model aims at emphasizing the useful information attention.Many methods have been suggested to detect inter- of feature sets and simultaneously minimizing noise influence. esting events for specific type videos,such as abnormal detec- We can relatively safely assume that different features have un- tion [43]in surveillance video,excitement and interestingness correlated noise characteristics.The interaction of feature sets detection in sports video [7],[23],[24],brilliant music detec- is shot-related information expressed as tion [28],and so on.A detailed description of these methods is beyond the scope of this paper.However,for specific type ⑨≈-1(g of multi-view videos,interesting event detection can be inte- grated into our method.For those shots that contain faces in most frames or interest events,the importance scores are set to In the above formula,importance of shot S is measured by 1 adding up information amount of the individual features and subtracting information amount of their union.Since noises con- C.Correlations Among Shots by Hypergraph tained in different features are uncorrelated,the above formula weakens noise influence and the useful information is empha- Basically,three kinds of relationships among multi-view sized.According to information theory,a measure of the amount shots are considered: of information is entropy.We then add up the entropy values of Temporal adjacency.Two shots are likely to describe the all feature sets and subtract the entropy of their union from the same event if one shot is temporally adjacent to the other. sum Visual similarity.Two shots are related to each other if they M are visually similar. H(S) >H(F)-H(,F2,,M》 (2) Semantic correlation.Two shots may correlate with each other due to the same event or semantic object such as a whereF=(..fi)Tfij is the ith feature set for face occurs in both shots. the jth frame of shot S.H()denotes entropy of the feature. Temporal adjacency implies that adjacent video shots may To estimate the probability of pF)and p(F,F2,...,FM). share the same semantic concepts with relatively higher proba- a common idea is to approximate them with the Gaussian dis- bility.For two shots Si and Si,the temporal similarity is defined tribution as pF)W(0,) (3) Wi(SiS)=on+o2d+as. (7) (F,2,,FM)W(0,) (4) where is the covariance matrix off(=1,....M). where d computes the temporal distance.i and and is the one offF,....FM are normalized tj are the time stamp of their middle frames.d is further inte- by grated into a light attenuation function,in which the parameters o1,02,and 03 control the temporal similarity.We use the fol- lowing ways to set their values.Given the training shots parsed (5) from a mono-view video,we first compute temporal similarities of each shot pair given initial values.Then we modify the values until the temporal similarities computed are in accordance with our observation.Through experiments,a1,02,and a3 are set as By virtual of nonlinear time series analysis [41],the Gaussian 1,0.01,and 0.0001,respectively.Different settings of the three entropy of shot S is finally expressed as parameters will have the same effect on summarization if the values are given with regard to an invariant relative magnitude. H(S Clog2(②)-2lg回 (6) We use similar ways to set other parameters in similarity com- putation. Visual similarity is computed by where ii is the jth element in the diagonal of matrix >> T=l-入is eigenvalue of见. Wu(Si,Sj)=e-kwVisSim(S:,Sj) (8)
FU et al.: MULTI-VIEW VIDEO SUMMARIZATION 721 first extract from the videos a set of intrinsic low-level features which are often correlated with each other. We now mainly take into account the visual features. They are color histogram feature, edge histogram feature, and wavelet feature [9], [39], [40]. The features in other modalities, such as textual and aural features used in previous video analysis methods [11], [19], however, can also be integrated into our method. Without losing generality, for shot with frames, suppose that overall feature vector sets are extracted. We expand each feature into a one-column vector. Two arbitrary features and may have different dimensions. We denote the feature sets by . The feature sets contain shot-related useful information. Besides, they are often contaminated by noises. The Gaussian entropy fusion model aims at emphasizing the useful information of feature sets and simultaneously minimizing noise influence. We can relatively safely assume that different features have uncorrelated noise characteristics. The interaction of feature sets is shot-related information expressed as (1) In the above formula, importance of shot is measured by adding up information amount of the individual features and subtracting information amount of their union. Since noises contained in different features are uncorrelated, the above formula weakens noise influence and the useful information is emphasized. According to information theory, a measure of the amount of information is entropy. We then add up the entropy values of all feature sets and subtract the entropy of their union from the sum (2) where . is the th feature set for the th frame of shot . denotes entropy of the feature. To estimate the probability of and , a common idea is to approximate them with the Gaussian distribution (3) (4) where is the covariance matrix of , and is the one of . are normalized by (5) By virtual of nonlinear time series analysis [41], the Gaussian entropy of shot is finally expressed as (6) where is the th element in the diagonal of matrix . . is eigenvalue of . The entropy is a measure of information encoded by the shot . We take it as the importance. An additional advantage of the Gaussian entropy fusion scheme is that it works well as long as the union of feature vector groups covers most useful information of multi-view videos. Therefore, instead of using all the feature sets, it would be sufficient if some well-defined feature sets are available. 2) Shot Importance by High-Level Semantics: Humans are usually important content in video sequence. We employ the Viola-Jones face detector [42] to detect faces in each frame. In addition, video summarization is often domain-specific. Definition of shot importance may vary according to different video genres. For instance, in a baseball game video, the shots that contain “home run”, “catch”, and “hit” usually catch much user attention. Many methods have been suggested to detect interesting events for specific type videos, such as abnormal detection [43] in surveillance video, excitement and interestingness detection in sports video [7], [23], [24], brilliant music detection [28], and so on. A detailed description of these methods is beyond the scope of this paper. However, for specific type of multi-view videos, interesting event detection can be integrated into our method. For those shots that contain faces in most frames or interest events, the importance scores are set to 1. C. Correlations Among Shots by Hypergraph Basically, three kinds of relationships among multi-view shots are considered: • Temporal adjacency. Two shots are likely to describe the same event if one shot is temporally adjacent to the other. • Visual similarity. Two shots are related to each other if they are visually similar. • Semantic correlation. Two shots may correlate with each other due to the same event or semantic object such as a face occurs in both shots. Temporal adjacency implies that adjacent video shots may share the same semantic concepts with relatively higher probability. For two shots and , the temporal similarity is defined as (7) where computes the temporal distance. and are the time stamp of their middle frames. is further integrated into a light attenuation function, in which the parameters , , and control the temporal similarity. We use the following ways to set their values. Given the training shots parsed from a mono-view video, we first compute temporal similarities of each shot pair given initial values. Then we modify the values until the temporal similarities computed are in accordance with our observation. Through experiments, , , and are set as 1, 0.01, and 0.0001, respectively. Different settings of the three parameters will have the same effect on summarization if the values are given with regard to an invariant relative magnitude. We use similar ways to set other parameters in similarity computation. Visual similarity is computed by (8)
723 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 where is a control parameter set to 0.1.For computational In addition,to further simplify the graph,the edge weight W is efficiency,we select three frames,namely,the first,middle,and set to zero if it is smaller than a predefined threshold. last,from S;and Si separately and calculate the visual similarity according to their color histogram HC and edge histogram He [40]distances V.MULTI-VIEW SUMMARIZATION The spatio-temporal shot graph is a suitable representation of multi-view video structure.since it carries shot information and VisSim(Si,Sj)=w.Hc(Si)-Hc(SjHE(Si)-HE(Sj). meanwhile reflects intuitively correlations among shots.Due (9) to the correlations among multi-view shots,the shot graph has The use of edge histogram weakens the infuence of lighting dif- complicated connectivity.This makes the summarization task ference across multi-view shots.w here is a weight that is em- challenging.We must generate the most representative graph pirically set to 0.5.For specific domains,W could be modified nodes (shots)by taking into consideration the connections as to accommodate more complex texture information,as well as well as users'requirements.Our basic observation is that,with motion features. the shot graph,the multi-view video summarization can be for- Semantic correlation of video shots is often related to the mulated as a graph labeling problem.We accomplish this in two occurrences of specific events.Besides,it varies with different steps.We first cluster those event-centered similar shots,and video genres.For instance,for surveillance videos,there is a pick out the candidates for summarization by random walks. definite correlation between two shots if the same human face Final summary is generated by a multi-objective optimization is detected in both shots.However,for football game videos, process that is specifically devised for accommodating different there exists a strong correlation among the shots that record all user requirements the goals in a match.Since a comprehensive study of semantic correlation is beyond the scope of this paper,in our current im- A.Shot Clustering by Random Walks plementation,we allow the user to interactively specify seman- tically correlated video shots.Semantic correlation value Ws of To cluster similar shots,we first sample a small number of two correlated shots S;and Si is set to 1. important shots and then cluster the shots of the same events by To measure the total similarity W of shots Si and Sj,a random walks.We adopt random walks in this step rather than straightforward way is to fuse together the above similarity other graph partition algorithms such as graph cut because of values with certain weights and to construct the spatio-tem- the following reasons. poral graph directly.Such a scheme,however,may destroy On one hand,random walks hs proven to be effective in han- the original relationship once the fusion weights could not dling large and complex graphs,even in the presence of con- be set properly.For instance,two shots with large temporal spicuous noises.It is thus suitable for our clustering task which distance visually resemble each other,imaging a people who needs to partition the spatio-temporal shot graph with compli- repeats his actions at a 24-h interval in the same scene.A strong cated node connections.Graph cut,however,is prone to the correlation between the two shots should exist.Nevertheless. small cut and noise influence [47. improper weights will make W too small and negligible.A On the other hand,our graph partition is a K-way segmenta- natural way to remedy the flaw occurring above is to represent tion problem given sampled shots indicating seeds for candidate the correlations among multi-view shots as a hypergraph.A clusters.Random walks algorithm works well for such problem. hypergraph is a graph in which an edge can connect more than The random walker starts from each unsampled node (shot)and two nodes.This edge is named as a hyperedge [44]which often determines for it the most preferable sampled shot's cluster.The links a subset of nodes.Obviously,in this sense,an ordinary final clusters thus obtained are actually event-centered.In gen- graph is a special kind of hypergraph. eral,many events can be represented as object activities and in- In our hypergraph,the nodes just represent video shots.To teractions,showing different motion patterns [48].For the event construct the hyperedges,we build for each relationship,i.e., captured by multi-view shots,similarities among shots in terms temporal adjacency,visual similarity as well as semantic corre-of visual,temporal,as well as semantic correlations should be lation,an ordinary graph and apply graph clustering algorithm large.In addition,each event may have at least a central shot to the nodes.All the nodes in a cluster are then linked by a hy- which has a high shot importance.We can take it as one of the peredge in the hypergraph.Note that two cluster may overlap as best views recording this event.The random walks-based shot for each relationship the clustering algorithm is performed.The clustering fulfills these requirements in that we select the shots weight on the hyperedge is the average of relation values of all with higher importance as seeded nodes.Such shots just can pairs of nodes in the same cluster. be viewed as the centers of events.Furthermore.the weight on Generally,there are two methods to transform a hypergraph graph is defined in form of shots'similarities which makes clus- into a general graph.One is directly using the hypergraph par- tering event relevant shots possible.Notice that the property of tition algorithm such as normalized hypergraph cut [45].The our event-centered clustering also facilitates video retrieval by other seeks solution through clique expansion or star expansion allowing the user to specify their interested shots as seeds.The [46].We employ clique expansion to convert the hypergraph final clusters containing seeds are thus the retrieval results into the spatio-temporal graph.By clique expansion,each hy- Although a detailed description of random walks theory is peredge is expanded into a clique,in which the weight on each beyond the scope of this paper,it essentially works as follows. pair of nodes is taken as the weight on the hyperedge.On the First,we partition the node set V into seeded nodes Vs and spatial temporal shot graph,edge weight W is the sum of edge unseeded nodes Vu,satisfying that the value of each seed in Vs weights derived from those cliques to which the edge belongs. exceeds an entropy threshold
722 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 where is a control parameter set to 0.1. For computational efficiency, we select three frames, namely, the first, middle, and last, from and separately and calculate the visual similarity according to their color histogram and edge histogram [40] distances (9) The use of edge histogram weakens the influence of lighting difference across multi-view shots. here is a weight that is empirically set to 0.5. For specific domains, could be modified to accommodate more complex texture information, as well as motion features. Semantic correlation of video shots is often related to the occurrences of specific events. Besides, it varies with different video genres. For instance, for surveillance videos, there is a definite correlation between two shots if the same human face is detected in both shots. However, for football game videos, there exists a strong correlation among the shots that record all the goals in a match. Since a comprehensive study of semantic correlation is beyond the scope of this paper, in our current implementation, we allow the user to interactively specify semantically correlated video shots. Semantic correlation value of two correlated shots and is set to 1. To measure the total similarity of shots and , a straightforward way is to fuse together the above similarity values with certain weights and to construct the spatio-temporal graph directly. Such a scheme, however, may destroy the original relationship once the fusion weights could not be set properly. For instance, two shots with large temporal distance visually resemble each other, imaging a people who repeats his actions at a 24-h interval in the same scene. A strong correlation between the two shots should exist. Nevertheless, improper weights will make too small and negligible. A natural way to remedy the flaw occurring above is to represent the correlations among multi-view shots as a hypergraph. A hypergraph is a graph in which an edge can connect more than two nodes. This edge is named as a hyperedge [44] which often links a subset of nodes. Obviously, in this sense, an ordinary graph is a special kind of hypergraph. In our hypergraph, the nodes just represent video shots. To construct the hyperedges, we build for each relationship, i.e., temporal adjacency, visual similarity as well as semantic correlation, an ordinary graph and apply graph clustering algorithm to the nodes. All the nodes in a cluster are then linked by a hyperedge in the hypergraph. Note that two cluster may overlap as for each relationship the clustering algorithm is performed. The weight on the hyperedge is the average of relation values of all pairs of nodes in the same cluster. Generally, there are two methods to transform a hypergraph into a general graph. One is directly using the hypergraph partition algorithm such as normalized hypergraph cut [45]. The other seeks solution through clique expansion or star expansion [46]. We employ clique expansion to convert the hypergraph into the spatio-temporal graph. By clique expansion, each hyperedge is expanded into a clique, in which the weight on each pair of nodes is taken as the weight on the hyperedge. On the spatial temporal shot graph, edge weight is the sum of edge weights derived from those cliques to which the edge belongs. In addition, to further simplify the graph, the edge weight is set to zero if it is smaller than a predefined threshold. V. MULTI-VIEW SUMMARIZATION The spatio-temporal shot graph is a suitable representation of multi-view video structure, since it carries shot information and meanwhile reflects intuitively correlations among shots. Due to the correlations among multi-view shots, the shot graph has complicated connectivity. This makes the summarization task challenging. We must generate the most representative graph nodes (shots) by taking into consideration the connections as well as users’ requirements. Our basic observation is that, with the shot graph, the multi-view video summarization can be formulated as a graph labeling problem. We accomplish this in two steps. We first cluster those event-centered similar shots, and pick out the candidates for summarization by random walks. Final summary is generated by a multi-objective optimization process that is specifically devised for accommodating different user requirements. A. Shot Clustering by Random Walks To cluster similar shots, we first sample a small number of important shots and then cluster the shots of the same events by random walks. We adopt random walks in this step rather than other graph partition algorithms such as graph cut because of the following reasons. On one hand, random walks hs proven to be effective in handling large and complex graphs, even in the presence of conspicuous noises. It is thus suitable for our clustering task which needs to partition the spatio-temporal shot graph with complicated node connections. Graph cut, however, is prone to the small cut and noise influence [47]. On the other hand, our graph partition is a -way segmentation problem given sampled shots indicating seeds for candidate clusters. Random walks algorithm works well for such problem. The random walker starts from each unsampled node (shot) and determines for it the most preferable sampled shot’s cluster. The final clusters thus obtained are actually event-centered. In general, many events can be represented as object activities and interactions, showing different motion patterns [48]. For the event captured by multi-view shots, similarities among shots in terms of visual, temporal, as well as semantic correlations should be large. In addition, each event may have at least a central shot which has a high shot importance. We can take it as one of the best views recording this event. The random walks-based shot clustering fulfills these requirements in that we select the shots with higher importance as seeded nodes. Such shots just can be viewed as the centers of events. Furthermore, the weight on graph is defined in form of shots’ similarities which makes clustering event relevant shots possible. Notice that the property of our event-centered clustering also facilitates video retrieval by allowing the user to specify their interested shots as seeds. The final clusters containing seeds are thus the retrieval results. Although a detailed description of random walks theory is beyond the scope of this paper, it essentially works as follows. First, we partition the node set into seeded nodes and unseeded nodes , satisfying that the value of each seed in exceeds an entropy threshold
FU et al:MULTI-VIEW VIDEO SUMMARIZATION 723 View 1 Video View 1 Video View 2 Video View 2 Video View N Video View N Video Fig.4.Final video summary resulting from optimization.Dashed lines connect Fig.3.Graph partition by random walks.Shot clusters generated by random the shots that are reserved in the same shot cluster. walks are enclosed by dashed circles. We formulate the summarization as a graph labeling problem. We then define the combinatorial Laplacian matrix for graph For the shot cluster Cs with ns shots,the decision whether as follows: or not the shots should be in the summary is denoted by (1,2,...,ns),V EX.X is the 0/1 solution space in which ∑W(S,S),ifi=j Ti=1 stands for reserved shot and 0 stands for unreserved one Li= -W(Si,Sj), if Si,Si are linked by edge (10) (F1g.4). 、02 otherwise. The multi-objective optimization function is given by L is an Nn X Nn dimensional sparse,symmetric,and positive max{-f(x),-f2(x),f3(x),fa(x)}s.t. g(c)≤Dmat definite matrix,where N is the number of nodes in graph. h(c)≥Rmim 13) We further decompose L into blocks corresponding to nodes in Vs and Vu separately as where f(a)=1,f2()=∑1D:·x,D:>0, f3() 1·,andf4(c)=(1/2)· (11) =l,jW(S,S):· f1,f2,f3,and f4 denote the total shot number,summary length,information coverage,and shot correlation within For each unseeded node,the final determination to which seeded cluster,respectively.Di and Ri are length and importance of cluster it belongs to is made by solving shot i separately.g(r)and h()are defined in forms of fuzzy LUXU =-BTXs (12) set [51] where Xu represents the probabilities that unseeded nodes be- g(x)=u(2(c),h(x)=u(f3(c) long to seeded nodes'clusters.X.s denotes the matrix that marks the cluster category of seeded nodes.We use a Conjugate Gra- with f(x)=[f(x)-inff(r]/supf(x)-inff(x小: dient algorithm [49]to solve the linear formulation of random Dma is the maximum allocated length of one cluster.Rmin walks,which runs very fast. is the minimum information entropy of Cs.They are defined as In the end,to favor important events with long duration, Dmam=入1·D,Rmim=2·R we filter out trivial shot clusters with low entropy values. Furthermore,the two clusters whose similarity exceeds a given where D and R are the total length of shots in C's and the sum threshold are merged together.The remainder shot clusters of importance values,respectively.A1 and A2 are the parameters are used as candidates for summarization in multi-objective that control summary granularity.The two constraints mean that optimization (Fig.3). the total length of shots in Cs after optimization should be less B.Multi-Objective Optimization than Dmas,whereas the entropy should be greater than Rmin We will show in experiments,by flexibly configuring A1 and A2, Users normally have various requirements over summariza multi-level summarization can be easily achieved. tion,according to different kinds of application scenarios.In We further define the minimum function general,a good summary should achieve the following goals si- multaneously.1)Minimize shot number.The retrieval applica- u(F())=min,tniu(fi(} (14) 2<4 tion of summary requires that a small number of shots should be generated.2)Minimize summary length.The minimum length of summary would be of great help to video storage.3)Maxi- in which F()=((f()),u(f2()),uf3()),(fa(xT i,=1,.4 are coefficients that control the weights of objective mize information coverage.To achieve enough information cov- functions satisfying=1 and ni0.They can be erage,the sum of resulting shots'entropy value in each cluster configured according to different user requirements. must exceed a certain threshold.4)Maximize shot correlation.It would be much better if shots in every resulting cluster strongly By employing the Max-Min method,the multi-objective op- timization is transformed into the following 0-1 mixed integer correlate with each other.This yields the most representative programming problem: shots for the interesting event. To meet the above requirements,we design a multi-objec- tive optimization model to generate final summary.The opti- x*=argmax u(F(x))s.t.A·F≤ -Rmin (15) mization follows the complexity incompatibility principle [50]. 石
FU et al.: MULTI-VIEW VIDEO SUMMARIZATION 723 Fig. 3. Graph partition by random walks. Shot clusters generated by random walks are enclosed by dashed circles. We then define the combinatorial Laplacian matrix for graph as follows: if if are linked by edge otherwise. (10) is an dimensional sparse, symmetric, and positive definite matrix, where is the number of nodes in graph. We further decompose into blocks corresponding to nodes in and separately as (11) For each unseeded node, the final determination to which seeded cluster it belongs to is made by solving (12) where represents the probabilities that unseeded nodes belong to seeded nodes’ clusters. denotes the matrix that marks the cluster category of seeded nodes. We use a Conjugate Gradient algorithm [49] to solve the linear formulation of random walks, which runs very fast. In the end, to favor important events with long duration, we filter out trivial shot clusters with low entropy values. Furthermore, the two clusters whose similarity exceeds a given threshold are merged together. The remainder shot clusters are used as candidates for summarization in multi-objective optimization (Fig. 3). B. Multi-Objective Optimization Users normally have various requirements over summarization, according to different kinds of application scenarios. In general, a good summary should achieve the following goals simultaneously. 1) Minimize shot number. The retrieval application of summary requires that a small number of shots should be generated. 2) Minimize summary length. The minimum length of summary would be of great help to video storage. 3) Maximize information coverage. To achieve enough information coverage, the sum of resulting shots’ entropy value in each cluster must exceed a certain threshold. 4) Maximize shot correlation. It would be much better if shots in every resulting cluster strongly correlate with each other. This yields the most representative shots for the interesting event. To meet the above requirements, we design a multi-objective optimization model to generate final summary. The optimization follows the complexity incompatibility principle [50]. Fig. 4. Final video summary resulting from optimization. Dashed lines connect the shots that are reserved in the same shot cluster. We formulate the summarization as a graph labeling problem. For the shot cluster with shots, the decision whether or not the shots should be in the summary is denoted by . is the 0/1 solution space in which stands for reserved shot and 0 stands for unreserved one (Fig. 4). The multi-objective optimization function is given by (13) where , , , and . , , , and denote the total shot number, summary length, information coverage, and shot correlation within cluster, respectively. and are length and importance of shot separately. and are defined in forms of fuzzy set [51] with . is the maximum allocated length of one cluster. is the minimum information entropy of . They are defined as where and are the total length of shots in and the sum of importance values, respectively. and are the parameters that control summary granularity. The two constraints mean that the total length of shots in after optimization should be less than , whereas the entropy should be greater than . We will show in experiments, by flexibly configuring and , multi-level summarization can be easily achieved. We further define the minimum function (14) in which . are coefficients that control the weights of objective functions satisfying and . They can be configured according to different user requirements. By employing the Max-Min method, the multi-objective optimization is transformed into the following 0-1 mixed integer programming problem: (15)
724 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 TABLE I DETAILS OF MULTI-VIEW VIDEOS AND SUMMARIES Multi-view No.of Video Length Levels of Level Summary Length (Mins.) 入2 Videos Views (Mins.) Summary Info.Reserved (%) officel 4 11:16/8:43/11:22/14:58 Level 1 1:53 70 campus 4 15:19/13:51/12:30/15:03 Level 1 4:02 60 2:56 office lobby 3 08:14/08:14/08:14 Level 1 60 Level 2 5:I4 70 Level 1 2:21 60 road 3 5:11/8:49/8:46 2 Level 2 4:28 70 Level 1 0:50 60 badminton 3 5:07/5:00/5:00 3 Level 2 108 65 Level 3 2:08 70 0 illustrated in Fig.6.The summarized shots are assembled along with A 0-1 0 is the final optimization the timeline across multi-views.Each shot is represented with 1-1-1 a box and the number in box illustrates the view to which the result to be solved. shot belongs.Dashed blue boxes represent those events that are This integer programming is a typical knapsack problem in recorded by more than one shot or different views.By clicking combinatorial optimization.We use a pseudo-polynomial time on the boxes,the shots can be displayed.Obviously,through the dynamic programming algorithm [52]to solve it.The algorithm event-board,we can easily generate a single video summary that runs fast for all of our experiments. includes all the summarized shots.We show some examples of the single video summary in our demo page.One of its advan- VI.EXPERIMENTS tages over storyboard is that it allows the rapid browse of sum- We conducted experiments on several multi-view videos,in- marized result.If the user needs to browse the summary within cluding typical indoor and outdoor environments.The officel, limited time,the single summary would be a good choice. campus,office lobby,and road videos are typical surveillance A distinct characteristic of the multi-view videos is that the videos,since surveillance videos are one of the most important events are captured with overlapping across multiple views.To multi-view video types.Some multi-view videos are semi-syn- generate a compact yet highly informative summary,it is usually chronous or nonsynchronous.Most multi-view videos are cap- important to summarize a certain event only in the most infor- tured by three or four ordinary cameras with overall 360 de- mative view,and to avoid repetitive summary.This is especially gree coverage of the scene.To further verify our method,we true if the user only hopes to obtain a short length video sum- also deliberately shoot an outdoor scene by four cameras with mary.Our method realizes this.One example is shown in the only 180 degree coverage.Note that all of the videos are cap-summary of multi-view officel videos.In the 24th shot,the girl tured using the web cameras or handheld ordinary video cam- who opened the door and went to her cube is only reserved in eras by nonspecialists,making some of them unstable and ob- the second view,although she appeared in four views simulta- scure.Moreover,some videos have quite different brightness neously.The man who opened the door in the 24th shot and left across multi-views.These issues pose great challenges to the the room in the 25th shot is only reserved in the second view.In multi-view video summarization. this sense,our method can be applied to the selection of optimal Table I shows the information on experimental data.All ex- views.In addition,the method supports summarizing the same perimental results were collected in a PC equipped with P4 event using temporally successive multi-view shots.The event 3.0-GHZ CPU and 1 GB of memory.The multi-view videos as is recorded by the shots describing it with the best views in its well as summaries can be found in the demo page http://cs.nju. duration. edu.cn/ywguo/summarization.html. On the other hand,it is also reasonable to produce a multi- Note that we sacrifice the visual quality of original multi-view summary for the same event.For example,for a traffic ac- view videos to meet the space limitation of online storage by cident,all videos in multi-views are often crucial in responsi- compressing them with high compression ratios. bility identification and verification.Our method handles this Display of multi-view summary.We employ here the multi-case successfully.In the multi-view officel videos,three guys view storyboard to represent the multi-view video summary,as intruded the views and left the room.This action is reserved illustrated in Fig.5.The storyboard naturally reflects spatial and simultaneously in the 22nd shot of the second view and 40th temporal information of the resulting shots as well as their corre- shot of the fourth view.Other typical examples are the 28th and lations,allowing the user to walk through and analyze the sum- 35th shots.30th and 46th shots.and 38th and 49th shots.Such marized shots in a natural and intuitive way.In the storyboard, summaries are attributed to two points.First,the shot impor- each shot in summary is represented by its middle frame.By tance computation algorithm fairly computes the importance of clicking on the yellow block highlighted with corresponding multi-view shots,even in the presence of brightness difference shot number,the user can browse the summarized video shot.and noises.Second,the summarization method makes the most Dashed lines connect those shots of the same scene-event de- of correlations among multi-view shots. rived from random walks clustering and multi-objective opti-Multi-level summarization can be conveniently achieved by mization.By means of the multi-view storyboard,we further our method.We only need to configure the two parameters A1 introduce an event-board to display the multi-view summary as and A2 in multi-objective optimization.As aforementioned,Al
724 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 TABLE I DETAILS OF MULTI-VIEW VIDEOS AND SUMMARIES with . is the final optimization result to be solved. This integer programming is a typical knapsack problem in combinatorial optimization. We use a pseudo-polynomial time dynamic programming algorithm [52] to solve it. The algorithm runs fast for all of our experiments. VI. EXPERIMENTS We conducted experiments on several multi-view videos, including typical indoor and outdoor environments. The office1, campus, office lobby, and road videos are typical surveillance videos, since surveillance videos are one of the most important multi-view video types. Some multi-view videos are semi-synchronous or nonsynchronous. Most multi-view videos are captured by three or four ordinary cameras with overall 360 degree coverage of the scene. To further verify our method, we also deliberately shoot an outdoor scene by four cameras with only 180 degree coverage. Note that all of the videos are captured using the web cameras or handheld ordinary video cameras by nonspecialists, making some of them unstable and obscure. Moreover, some videos have quite different brightness across multi-views. These issues pose great challenges to the multi-view video summarization. Table I shows the information on experimental data. All experimental results were collected in a PC equipped with P4 3.0-GHZ CPU and 1 GB of memory. The multi-view videos as well as summaries can be found in the demo page http://cs.nju. edu.cn/ywguo/summarization.html. Note that we sacrifice the visual quality of original multiview videos to meet the space limitation of online storage by compressing them with high compression ratios. Display of multi-view summary. We employ here the multiview storyboard to represent the multi-view video summary, as illustrated in Fig. 5. The storyboard naturally reflects spatial and temporal information of the resulting shots as well as their correlations, allowing the user to walk through and analyze the summarized shots in a natural and intuitive way. In the storyboard, each shot in summary is represented by its middle frame. By clicking on the yellow block highlighted with corresponding shot number, the user can browse the summarized video shot. Dashed lines connect those shots of the same scene-event derived from random walks clustering and multi-objective optimization. By means of the multi-view storyboard, we further introduce an event-board to display the multi-view summary as illustrated in Fig. 6. The summarized shots are assembled along the timeline across multi-views. Each shot is represented with a box and the number in box illustrates the view to which the shot belongs. Dashed blue boxes represent those events that are recorded by more than one shot or different views. By clicking on the boxes, the shots can be displayed. Obviously, through the event-board, we can easily generate a single video summary that includes all the summarized shots. We show some examples of the single video summary in our demo page. One of its advantages over storyboard is that it allows the rapid browse of summarized result. If the user needs to browse the summary within limited time, the single summary would be a good choice. A distinct characteristic of the multi-view videos is that the events are captured with overlapping across multiple views. To generate a compact yet highly informative summary, it is usually important to summarize a certain event only in the most informative view, and to avoid repetitive summary. This is especially true if the user only hopes to obtain a short length video summary. Our method realizes this. One example is shown in the summary of multi-view office1 videos. In the 24th shot, the girl who opened the door and went to her cube is only reserved in the second view, although she appeared in four views simultaneously. The man who opened the door in the 24th shot and left the room in the 25th shot is only reserved in the second view. In this sense, our method can be applied to the selection of optimal views. In addition, the method supports summarizing the same event using temporally successive multi-view shots. The event is recorded by the shots describing it with the best views in its duration. On the other hand, it is also reasonable to produce a multiview summary for the same event. For example, for a traffic accident, all videos in multi-views are often crucial in responsibility identification and verification. Our method handles this case successfully. In the multi-view office1 videos, three guys intruded the views and left the room. This action is reserved simultaneously in the 22nd shot of the second view and 40th shot of the fourth view. Other typical examples are the 28th and 35th shots, 30th and 46th shots, and 38th and 49th shots. Such summaries are attributed to two points. First, the shot importance computation algorithm fairly computes the importance of multi-view shots, even in the presence of brightness difference and noises. Second, the summarization method makes the most of correlations among multi-view shots. Multi-level summarization can be conveniently achieved by our method. We only need to configure the two parameters and in multi-objective optimization. As aforementioned
FU et al:MULTI-VIEW VIDEO SUMMARIZATION 725 videos View 1 video View 2 video 11821 +22 2412526 28 View 3 video View 4 video 505153 time Fig.5.Multi-view video storyboard.Without losing generality,the multi-view officel videos with four views are given for illustration.The blue rectangles denote original multi-view videos.Each shot in summary is represented with a yellow box,by clicking on which the corresponding shot can be displayed.Each shot in summary is assigned a number indicating its order in those shots resulting from the video parsing process.Here,we give the numbers for the convenience of further discussion.Dashed lines connect those shots with strong correlations.The middle frames of a few resulting shots,which allow the quick browse of the summary, are demonstrated here. 22 Fig.6.Event-board assembles the event-centered summarized shots in temporal order.Each shot is represented with a box and the number in the box illustrates the view to which the shot belongs.Dashed blue boxes represent those events that are recorded by more than one shot or different views.By clicking on the boxes, the summarized shots can be displayed.Some representative frames,usually the middle frames of the shots,are showed for quick preview of the summary
FU et al.: MULTI-VIEW VIDEO SUMMARIZATION 725 Fig. 5. Multi-view video storyboard. Without losing generality, the multi-view office1 videos with four views are given for illustration. The blue rectangles denote original multi-view videos. Each shot in summary is represented with a yellow box, by clicking on which the corresponding shot can be displayed. Each shot in summary is assigned a number indicating its order in those shots resulting from the video parsing process. Here, we give the numbers for the convenience of further discussion. Dashed lines connect those shots with strong correlations. The middle frames of a few resulting shots, which allow the quick browse of the summary, are demonstrated here. Fig. 6. Event-board assembles the event-centered summarized shots in temporal order. Each shot is represented with a box and the number in the box illustrates the view to which the shot belongs. Dashed blue boxes represent those events that are recorded by more than one shot or different views. By clicking on the boxes, the summarized shots can be displayed. Some representative frames, usually the middle frames of the shots, are showed for quick preview of the summary
726 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 is integrated into the constraint that controls total length of sum- around crests of attention curve,the method does not provide a mary.A2 is used to adjust information coverage.Increasing A1 mechanism to remove content redundancy among multi-views and A2 simultaneously will generate a long and meanwhile in- It is obvious that the summaries produced by the method contain formative summary. much redundant information.There exist significant temporal The multi-view badminton videos are summarized into three overlaps among summarized multi-views shots.Most events are levels,according to the length and information entropy set for simultaneously recorded in the summaries. the summary.The parameter A is set to 0.035,0.075,and 0.15, By using our multi-view summarization method,such redun- respectively,on the 1st,2nd,and 3rd level.A2 is set to 0.6.0.65,dancy is largely reduced in contrast.Some events are recorded and 0.7 accordingly.Obviously,the high-level summary covers by the most informative summarized shots,while the most most part of low-level summary,while reasonable disparity is important events are reserved in multi-view summaries.Some due to the different optimization procedures involved.The low- events that are ignored by previous method-for instance the level summary comprises the most highly repeated actions,such events recorded from Ist to 5th second,14th to 18th second,and as serve,smash,and dead bird.Such statistics can be used for 39th to 41st second in our officel single video summary-are badminton training.The high level summary in contrast appends reserved by our method in contrast.This is determined by more amazing rally,e.g.,the shots 67,79,124,135,and 154 on our shot clustering algorithm and multi-objective optimization level 3. operated on the spatio-temporal shot graph.Such property of Other examples of multi-level summarization include the of-our method facilitates generating a short-length.yet highly fice lobby and road videos.We summarize both of them into informative summary. two levels by setting A2 to 0.6 and 0.7,respectively.In gen- We also compare our algorithm against a graph-based sum- eral,the videos containing many events with different shot im- marization method.A single video is first formed by combining portance values are more suitable for multi-level summariza- the multi-view videos along the timeline.We then construct the tion.For such videos,the low-level summary contains the shots graph according tothe method givenin[10].Finalsummary is pro- which are enough to describe most of the original video events. duced by using normalized cut-based event clustering and high- The high-level compact summary,by contrast,comprises the lightdetection[14].Normalizedcut widelyemployedby previous events which are more active or salient. methods often suffers from the"small cut problem.This can be There are some discussions about the choice of and A2.In- problematic when the method uses heuristic criterion to select tuitively,A2 is used to control importance value of the summary. highlight fromeventclusters as summary.Thatis,someimportant In our method,shot importance is evaluated by the entropy de- events with short durations are missed.Our method,however,can fined in terms of low-level features and updated by high-level meet different summarization objectives by using the multi-ob- semantics.The total entropy of those shots that are discarded for jective optimization.Important events with much higher impor- their lower activities is too low to be taken into account.There-tance are reserved in multi-views,while some important events fore,we can relatively safely assume that all reserved shots con- with shot durations are preserved as well. tain most information of multi-view videos.A2 thus can be re- To quantitatively compare our method with previous ones, garded as the minimum percent information to be preserved in we use precision and recall to measure the performance.We in- summary.In implementation,入2 is given by user.For入1,we vited five graduate students who remained unknown about our try it from 0.05 to 1 with an increment of 0.05,and select the research to define the ground-truth video summaries.Each shot one ensuring a solution for (15)as A1. is labeled as a ground-truth shot only if the five guys agree with Computational complexity of our method mainly depends on each other.For the officel multi-view videos,totally 26 shots the lengths.resolutions,and activities of the multi-view videos. are labeled as ground-truth shots.The ground-truth summary The major cost is spent on video parsing and graph construction, of campus videos includes 29 shots.Precision and recall scores which take about 15 min for the officel example.In contrast,sum- of the methods are shown in Table II.Accurately controlling marization with random walks-based clustering and multi-objec- the summary lengths is difficult.The summaries of different tive optimization is fast.This step spends less than 1 min.since methods are all around 50 s.except the campus summary ob- the graph constructed only has nearly 60 nodes.Video summa- tained by the graph method [10],[14]is 109 s.The second/sixth rization is often used as a post-processing tool.Our method can row is the data computed by applying the method [11]to each be accelerated by high-performance computing system. view video separately.The third/seventh row is generated by applying it to the single video formed by first combining each view.Generally,for the officel multi-view videos,from the pre- A.Comparison With Mono-View Summarization cision scores,summaries obtained by each method belong to the We compare our method with previous mono-view video ground-truth.In contrast,precisions of the four methods com- summarization methods.The summaries produced by our puted on the campus videos are all around 50%.The campus method and previous ones are shown in the demo webpage. videos contain many trivial events.It is challenging to generate We implement the video summarization method presented an unbiased summary using the methods.The last column of in [11]and apply it to each view of the multi-view officel, the table indicates that our method is superior to others in terms campus,and office lobby videos.For each multi-view video, of recall.This suggests that our method is more effective in re- we combine the resulting shots along the timeline to form a moving content redundancy. single video summary.For a fair comparison,we also use the above method to summarize the single video formed by com- B.User Study bining the multi-view videos along the timeline,and generate To further evaluate the effectiveness of our method,we have a dynamic single video summary.As the summary is extracted carried out a user study.The aim is to assess the enjoyability
726 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 is integrated into the constraint that controls total length of summary. is used to adjust information coverage. Increasing and simultaneously will generate a long and meanwhile informative summary. The multi-view badminton videos are summarized into three levels, according to the length and information entropy set for the summary. The parameter is set to 0.035, 0.075, and 0.15, respectively, on the 1st, 2nd, and 3rd level. is set to 0.6, 0.65, and 0.7 accordingly. Obviously, the high-level summary covers most part of low-level summary, while reasonable disparity is due to the different optimization procedures involved. The lowlevel summary comprises the most highly repeated actions, such as serve, smash, and dead bird. Such statistics can be used for badminton training. The high level summary in contrast appends more amazing rally, e.g., the shots 67, 79, 124, 135, and 154 on level 3. Other examples of multi-level summarization include the of- fice lobby and road videos. We summarize both of them into two levels by setting to 0.6 and 0.7, respectively. In general, the videos containing many events with different shot importance values are more suitable for multi-level summarization. For such videos, the low-level summary contains the shots which are enough to describe most of the original video events. The high-level compact summary, by contrast, comprises the events which are more active or salient. There are some discussions about the choice of and . Intuitively, is used to control importance value of the summary. In our method, shot importance is evaluated by the entropy de- fined in terms of low-level features and updated by high-level semantics. The total entropy of those shots that are discarded for their lower activities is too low to be taken into account. Therefore, we can relatively safely assume that all reserved shots contain most information of multi-view videos. thus can be regarded as the minimum percent information to be preserved in summary. In implementation, is given by user. For , we try it from 0.05 to 1 with an increment of 0.05, and select the one ensuring a solution for (15) as . Computational complexity of our method mainly depends on the lengths, resolutions, and activities of the multi-view videos. The major cost is spent on video parsing and graph construction, whichtake about 15min forthe office1 example. In contrast, summarization with random walks-based clustering and multi-objective optimization is fast. This step spends less than 1 min, since the graph constructed only has nearly 60 nodes. Video summarization is often used as a post-processing tool. Our method can be accelerated by high-performance computing system. A. Comparison With Mono-View Summarization We compare our method with previous mono-view video summarization methods. The summaries produced by our method and previous ones are shown in the demo webpage. We implement the video summarization method presented in [11] and apply it to each view of the multi-view office1, campus, and office lobby videos. For each multi-view video, we combine the resulting shots along the timeline to form a single video summary. For a fair comparison, we also use the above method to summarize the single video formed by combining the multi-view videos along the timeline, and generate a dynamic single video summary. As the summary is extracted around crests of attention curve, the method does not provide a mechanism to remove content redundancy among multi-views. It is obvious that the summaries produced by the method contain much redundant information. There exist significant temporal overlaps among summarized multi-views shots. Most events are simultaneously recorded in the summaries. By using our multi-view summarization method, such redundancy is largely reduced in contrast. Some events are recorded by the most informative summarized shots, while the most important events are reserved in multi-view summaries. Some events that are ignored by previous method—for instance the events recorded from 1st to 5th second, 14th to 18th second, and 39th to 41st second in our office1 single video summary—are reserved by our method in contrast. This is determined by our shot clustering algorithm and multi-objective optimization operated on the spatio-temporal shot graph. Such property of our method facilitates generating a short-length, yet highly informative summary. We also compare our algorithm against a graph-based summarization method. A single video is first formed by combining the multi-view videos along the timeline. We then construct the graphaccordingtothemethodgivenin [10].Finalsummaryisproduced by using normalized cut-based event clustering and highlightdetection [14].Normalizedcutwidelyemployedbyprevious methods often suffers from the “small cut” problem. This can be problematic when the method uses heuristic criterion to select highlight fromeventclustersas summary.Thatis, someimportant eventswith short durations aremissed. Ourmethod, however, can meet different summarization objectives by using the multi-objective optimization. Important events with much higher importance are reserved in multi-views, while some important events with shot durations are preserved as well. To quantitatively compare our method with previous ones, we use precision and recall to measure the performance. We invited five graduate students who remained unknown about our research to define the ground-truth video summaries. Each shot is labeled as a ground-truth shot only if the five guys agree with each other. For the office1 multi-view videos, totally 26 shots are labeled as ground-truth shots. The ground-truth summary of campus videos includes 29 shots. Precision and recall scores of the methods are shown in Table II. Accurately controlling the summary lengths is difficult. The summaries of different methods are all around 50 s, except the campus summary obtained by the graph method [10], [14] is 109 s. The second/sixth row is the data computed by applying the method [11] to each view video separately. The third/seventh row is generated by applying it to the single video formed by first combining each view. Generally, for the office1 multi-view videos, from the precision scores, summaries obtained by each method belong to the ground-truth. In contrast, precisions of the four methods computed on the campus videos are all around 50%. The campus videos contain many trivial events. It is challenging to generate an unbiased summary using the methods. The last column of the table indicates that our method is superior to others in terms of recall. This suggests that our method is more effective in removing content redundancy. B. User Study To further evaluate the effectiveness of our method, we have carried out a user study. The aim is to assess the enjoyability