正在加载图片...
718 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 the edge weights can qualitatively measure similarities among [3]divided video sequence into clusters and selected optimal shots.We associate the value of a graph node with shot impor- ones using an unsupervised procedure for cluster-validity anal- tance computed by a Gaussian entropy fusion scheme.Such ysis.The centroids of clusters are chosen as key frames.Li a scheme can calculate the importance of shots in the pres- et al.[4]formulated key frame extraction as a rate-distortion ence of brightness difference and conspicuous noises,by em- Min-max optimization problem.The optimal solution is solved phasizing useful information and precluding redundancy among by dynamic programming.Besides,Orriols et al.[5]addressed video features.With the graph representation,the final summary summarization under a Bayesian framework.An EM algorithm is generated through the event clustering based on random walks with a generative model is developed to generate representative and a multi-objective optimization process. frames.Note that key frames can be transformed into skim by To the best of our knowledge,this paper presents the first joining up the segments that enclose them,and vice versa. multi-view video summarization method.It has the following In contrast to key frames,an advantage of video skim is that features. signals in other modalities such as audio information can be in- A spatio-temporal shot graph is used for the representa- cluded.Furthermore,skim preserves the time-evolving nature tion of multi-view videos.Such a representation makes the of the original video,making it more interesting and impres- multi-view summarization problem tractable in the light sive.Video saliency is necessary for summarization to produce of graph theory.The shot graph is derived from a hyper- the representative skim.For static image,Ma et al.[22]calcu- graph which embeds different correlations among video lated visual feature contrast as saliency.A normalized saliency shots within each view as well as across multiple views. value for each pixel is computed.To evaluate saliency of video Random walks are used to cluster the event-centered shot sequence,multi-modal features such as motion vector and audio clusters,and the final summary is generated by multi-ob- frequency should be considered [11],[16],[19].Ma et al.[11] jective optimization.The multi-objective optimization can presented a generic framework of user attention model through be flexibly configured to meet different summarization multiple sensory perceptions.Visual and aural attentions are requirements.Additionally,multi-level summaries can be fused into an attention curve,based on which key frames and achieved easily through setting different parameters.In video skims are extracted around the crests.Recently,You et al. contrast,most previous methods can only summarize the [19]also introduced a method for human perception analysis videos from a specific perspective on the summaries. by combining motion,contrast,special scenes,and statistical The multi-view video storyboard and the event-board are rhythm cues.They constructed a perception curve for labeling presented for representing multi-view video summary. three-level summary,namely,keywords,key frames,and video skim. The storyboard naturally reflects correlations among multi-view summarized shots that describe the same Various mechanisms have been used to generate video skim. important event.The event-board serially assembles Nam et al.[12]proposed to adaptively sample the video with visual activity-based sampling rate.Semantically meaningful event-centered multi-view shots in temporal order.With summaries are achieved through an event-oriented abstraction. the event-board,a single video summary that facilitates quick browsing of the summarized video can be easily By measuring shots'visual complexity and analyzing speech data,Sundaram et al.[17]generated audio-visual skims with generated. constrained utility maximization that maximizes information The rest of this paper is organized as follows.We briefly re- view previous work in Section II.In Section III,we present a content and coherence.Since summarization can be viewed as a dimension reduction problem,Gong and Liu proposed high-level overview of our method.The two key components of to summarize video by using singular value decomposition our method,spatio-temporal shot graph construction and multi- (SVD)[9].The SVD properties they derived help to output view summarization,are presented in Sections IV and V,re- the skim with user-specified length.Gong's another method spectively.We evaluate our method in Section VI and conclude [8]produces video summary by minimizing visual content the paper in Section VII. redundancy of the input video.Previous viewers'browsing log will assist in future viewers.Yu et al.'s method [20]learns user Ⅱ.RELATED WORK understanding of video content.A ShotRank is constructed to This paper is made possible by many inspirations from pre- measure importance of video shot.The top ranking shots are vious work on video summarization.A comprehensive review chosen as video skim. of the state-of-the-art video summarization methods can be Some techniques for generating video skims are domain-de- found in [21].In general,two basic forms of video summaries pendent.For example,Babaguchi [7]presented an approach exist,i.e.,the static key frames and dynamic video skim.The for abstracting soccer game videos by highlights.Using event- former consists of a collection of salient images fetched from based indexing,an abstracted video clip is automatically cre- the original video sequence,while the latter is composed of the ated based on impact factors of events.Soccer events can be most representative video segments extracted from the video detected by using temporal logic models [23]or goalmouth de- source. tection [24].Much attention has been paid to rush video sum- Key frame extraction should take into account the underlying marization [25-27.Rush videos often contain redundant and dynamics of video content.DeMenthon et al.[1]regarded video repetitive contents,by exploring which a concise summary can sequence as a curve in high-dimensional space.The curve is be generated.The methods in [15]and [18]focus on summa- recursively simplified with a tree structure representation.The rizing music videos via the analysis of audio,visual,and text frames corresponding to junctions between curve segments at The summary is generated based on the alignment of boundaries different tree levels are viewed as key frames.Hanjalic et al. of the chorus,shot class,and repeated lyrics of the music video718 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 the edge weights can qualitatively measure similarities among shots. We associate the value of a graph node with shot impor￾tance computed by a Gaussian entropy fusion scheme. Such a scheme can calculate the importance of shots in the pres￾ence of brightness difference and conspicuous noises, by em￾phasizing useful information and precluding redundancy among video features. With the graph representation, the final summary is generated through the event clustering based on random walks and a multi-objective optimization process. To the best of our knowledge, this paper presents the first multi-view video summarization method. It has the following features. • A spatio-temporal shot graph is used for the representa￾tion of multi-view videos. Such a representation makes the multi-view summarization problem tractable in the light of graph theory. The shot graph is derived from a hyper￾graph which embeds different correlations among video shots within each view as well as across multiple views. • Random walks are used to cluster the event-centered shot clusters, and the final summary is generated by multi-ob￾jective optimization. The multi-objective optimization can be flexibly configured to meet different summarization requirements. Additionally, multi-level summaries can be achieved easily through setting different parameters. In contrast, most previous methods can only summarize the videos from a specific perspective on the summaries. • The multi-view video storyboard and the event-board are presented for representing multi-view video summary. The storyboard naturally reflects correlations among multi-view summarized shots that describe the same important event. The event-board serially assembles event-centered multi-view shots in temporal order. With the event-board, a single video summary that facilitates quick browsing of the summarized video can be easily generated. The rest of this paper is organized as follows. We briefly re￾view previous work in Section II. In Section III, we present a high-level overview of our method. The two key components of our method, spatio-temporal shot graph construction and multi￾view summarization, are presented in Sections IV and V, re￾spectively. We evaluate our method in Section VI and conclude the paper in Section VII. II. RELATED WORK This paper is made possible by many inspirations from pre￾vious work on video summarization. A comprehensive review of the state-of-the-art video summarization methods can be found in [21]. In general, two basic forms of video summaries exist, i.e., the static key frames and dynamic video skim. The former consists of a collection of salient images fetched from the original video sequence, while the latter is composed of the most representative video segments extracted from the video source. Key frame extraction should take into account the underlying dynamics of video content. DeMenthon et al. [1] regarded video sequence as a curve in high-dimensional space. The curve is recursively simplified with a tree structure representation. The frames corresponding to junctions between curve segments at different tree levels are viewed as key frames. Hanjalic et al. [3] divided video sequence into clusters and selected optimal ones using an unsupervised procedure for cluster-validity anal￾ysis. The centroids of clusters are chosen as key frames. Li et al. [4] formulated key frame extraction as a rate-distortion Min-max optimization problem. The optimal solution is solved by dynamic programming. Besides, Orriols et al. [5] addressed summarization under a Bayesian framework. An EM algorithm with a generative model is developed to generate representative frames. Note that key frames can be transformed into skim by joining up the segments that enclose them, and vice versa. In contrast to key frames, an advantage of video skim is that signals in other modalities such as audio information can be in￾cluded. Furthermore, skim preserves the time-evolving nature of the original video, making it more interesting and impres￾sive. Video saliency is necessary for summarization to produce the representative skim. For static image, Ma et al. [22] calcu￾lated visual feature contrast as saliency. A normalized saliency value for each pixel is computed. To evaluate saliency of video sequence, multi-modal features such as motion vector and audio frequency should be considered [11], [16], [19]. Ma et al. [11] presented a generic framework of user attention model through multiple sensory perceptions. Visual and aural attentions are fused into an attention curve, based on which key frames and video skims are extracted around the crests. Recently, You et al. [19] also introduced a method for human perception analysis by combining motion, contrast, special scenes, and statistical rhythm cues. They constructed a perception curve for labeling three-level summary, namely, keywords, key frames, and video skim. Various mechanisms have been used to generate video skim. Nam et al. [12] proposed to adaptively sample the video with visual activity-based sampling rate. Semantically meaningful summaries are achieved through an event-oriented abstraction. By measuring shots’ visual complexity and analyzing speech data, Sundaram et al. [17] generated audio-visual skims with constrained utility maximization that maximizes information content and coherence. Since summarization can be viewed as a dimension reduction problem, Gong and Liu proposed to summarize video by using singular value decomposition (SVD) [9]. The SVD properties they derived help to output the skim with user-specified length. Gong’s another method [8] produces video summary by minimizing visual content redundancy of the input video. Previous viewers’ browsing log will assist in future viewers. Yu et al.’s method [20] learns user understanding of video content. A ShotRank is constructed to measure importance of video shot. The top ranking shots are chosen as video skim. Some techniques for generating video skims are domain-de￾pendent. For example, Babaguchi [7] presented an approach for abstracting soccer game videos by highlights. Using event￾based indexing, an abstracted video clip is automatically cre￾ated based on impact factors of events. Soccer events can be detected by using temporal logic models [23] or goalmouth de￾tection [24]. Much attention has been paid to rush video sum￾marization [25]–[27]. Rush videos often contain redundant and repetitive contents, by exploring which a concise summary can be generated. The methods in [15] and [18] focus on summa￾rizing music videos via the analysis of audio, visual, and text. The summary is generated based on the alignment of boundaries of the chorus, shot class, and repeated lyrics of the music video
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有