723 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 where is a control parameter set to 0.1.For computational In addition,to further simplify the graph,the edge weight W is efficiency,we select three frames,namely,the first,middle,and set to zero if it is smaller than a predefined threshold. last,from S;and Si separately and calculate the visual similarity according to their color histogram HC and edge histogram He [40]distances V.MULTI-VIEW SUMMARIZATION The spatio-temporal shot graph is a suitable representation of multi-view video structure.since it carries shot information and VisSim(Si,Sj)=w.Hc(Si)-Hc(SjHE(Si)-HE(Sj). meanwhile reflects intuitively correlations among shots.Due (9) to the correlations among multi-view shots,the shot graph has The use of edge histogram weakens the infuence of lighting dif- complicated connectivity.This makes the summarization task ference across multi-view shots.w here is a weight that is em- challenging.We must generate the most representative graph pirically set to 0.5.For specific domains,W could be modified nodes (shots)by taking into consideration the connections as to accommodate more complex texture information,as well as well as users'requirements.Our basic observation is that,with motion features. the shot graph,the multi-view video summarization can be for- Semantic correlation of video shots is often related to the mulated as a graph labeling problem.We accomplish this in two occurrences of specific events.Besides,it varies with different steps.We first cluster those event-centered similar shots,and video genres.For instance,for surveillance videos,there is a pick out the candidates for summarization by random walks. definite correlation between two shots if the same human face Final summary is generated by a multi-objective optimization is detected in both shots.However,for football game videos, process that is specifically devised for accommodating different there exists a strong correlation among the shots that record all user requirements the goals in a match.Since a comprehensive study of semantic correlation is beyond the scope of this paper,in our current im- A.Shot Clustering by Random Walks plementation,we allow the user to interactively specify seman- tically correlated video shots.Semantic correlation value Ws of To cluster similar shots,we first sample a small number of two correlated shots S;and Si is set to 1. important shots and then cluster the shots of the same events by To measure the total similarity W of shots Si and Sj,a random walks.We adopt random walks in this step rather than straightforward way is to fuse together the above similarity other graph partition algorithms such as graph cut because of values with certain weights and to construct the spatio-tem- the following reasons. poral graph directly.Such a scheme,however,may destroy On one hand,random walks hs proven to be effective in han- the original relationship once the fusion weights could not dling large and complex graphs,even in the presence of con- be set properly.For instance,two shots with large temporal spicuous noises.It is thus suitable for our clustering task which distance visually resemble each other,imaging a people who needs to partition the spatio-temporal shot graph with compli- repeats his actions at a 24-h interval in the same scene.A strong cated node connections.Graph cut,however,is prone to the correlation between the two shots should exist.Nevertheless. small cut and noise influence [47. improper weights will make W too small and negligible.A On the other hand,our graph partition is a K-way segmenta- natural way to remedy the flaw occurring above is to represent tion problem given sampled shots indicating seeds for candidate the correlations among multi-view shots as a hypergraph.A clusters.Random walks algorithm works well for such problem. hypergraph is a graph in which an edge can connect more than The random walker starts from each unsampled node (shot)and two nodes.This edge is named as a hyperedge [44]which often determines for it the most preferable sampled shot's cluster.The links a subset of nodes.Obviously,in this sense,an ordinary final clusters thus obtained are actually event-centered.In gen- graph is a special kind of hypergraph. eral,many events can be represented as object activities and in- In our hypergraph,the nodes just represent video shots.To teractions,showing different motion patterns [48].For the event construct the hyperedges,we build for each relationship,i.e., captured by multi-view shots,similarities among shots in terms temporal adjacency,visual similarity as well as semantic corre-of visual,temporal,as well as semantic correlations should be lation,an ordinary graph and apply graph clustering algorithm large.In addition,each event may have at least a central shot to the nodes.All the nodes in a cluster are then linked by a hy- which has a high shot importance.We can take it as one of the peredge in the hypergraph.Note that two cluster may overlap as best views recording this event.The random walks-based shot for each relationship the clustering algorithm is performed.The clustering fulfills these requirements in that we select the shots weight on the hyperedge is the average of relation values of all with higher importance as seeded nodes.Such shots just can pairs of nodes in the same cluster. be viewed as the centers of events.Furthermore.the weight on Generally,there are two methods to transform a hypergraph graph is defined in form of shots'similarities which makes clus- into a general graph.One is directly using the hypergraph par- tering event relevant shots possible.Notice that the property of tition algorithm such as normalized hypergraph cut [45].The our event-centered clustering also facilitates video retrieval by other seeks solution through clique expansion or star expansion allowing the user to specify their interested shots as seeds.The [46].We employ clique expansion to convert the hypergraph final clusters containing seeds are thus the retrieval results into the spatio-temporal graph.By clique expansion,each hy- Although a detailed description of random walks theory is peredge is expanded into a clique,in which the weight on each beyond the scope of this paper,it essentially works as follows. pair of nodes is taken as the weight on the hyperedge.On the First,we partition the node set V into seeded nodes Vs and spatial temporal shot graph,edge weight W is the sum of edge unseeded nodes Vu,satisfying that the value of each seed in Vs weights derived from those cliques to which the edge belongs. exceeds an entropy threshold.722 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 where is a control parameter set to 0.1. For computational efficiency, we select three frames, namely, the first, middle, and last, from and separately and calculate the visual similarity according to their color histogram and edge histogram [40] distances (9) The use of edge histogram weakens the influence of lighting difference across multi-view shots. here is a weight that is empirically set to 0.5. For specific domains, could be modified to accommodate more complex texture information, as well as motion features. Semantic correlation of video shots is often related to the occurrences of specific events. Besides, it varies with different video genres. For instance, for surveillance videos, there is a definite correlation between two shots if the same human face is detected in both shots. However, for football game videos, there exists a strong correlation among the shots that record all the goals in a match. Since a comprehensive study of semantic correlation is beyond the scope of this paper, in our current implementation, we allow the user to interactively specify semantically correlated video shots. Semantic correlation value of two correlated shots and is set to 1. To measure the total similarity of shots and , a straightforward way is to fuse together the above similarity values with certain weights and to construct the spatio-temporal graph directly. Such a scheme, however, may destroy the original relationship once the fusion weights could not be set properly. For instance, two shots with large temporal distance visually resemble each other, imaging a people who repeats his actions at a 24-h interval in the same scene. A strong correlation between the two shots should exist. Nevertheless, improper weights will make too small and negligible. A natural way to remedy the flaw occurring above is to represent the correlations among multi-view shots as a hypergraph. A hypergraph is a graph in which an edge can connect more than two nodes. This edge is named as a hyperedge [44] which often links a subset of nodes. Obviously, in this sense, an ordinary graph is a special kind of hypergraph. In our hypergraph, the nodes just represent video shots. To construct the hyperedges, we build for each relationship, i.e., temporal adjacency, visual similarity as well as semantic correlation, an ordinary graph and apply graph clustering algorithm to the nodes. All the nodes in a cluster are then linked by a hyperedge in the hypergraph. Note that two cluster may overlap as for each relationship the clustering algorithm is performed. The weight on the hyperedge is the average of relation values of all pairs of nodes in the same cluster. Generally, there are two methods to transform a hypergraph into a general graph. One is directly using the hypergraph partition algorithm such as normalized hypergraph cut [45]. The other seeks solution through clique expansion or star expansion [46]. We employ clique expansion to convert the hypergraph into the spatio-temporal graph. By clique expansion, each hyperedge is expanded into a clique, in which the weight on each pair of nodes is taken as the weight on the hyperedge. On the spatial temporal shot graph, edge weight is the sum of edge weights derived from those cliques to which the edge belongs. In addition, to further simplify the graph, the edge weight is set to zero if it is smaller than a predefined threshold. V. MULTI-VIEW SUMMARIZATION The spatio-temporal shot graph is a suitable representation of multi-view video structure, since it carries shot information and meanwhile reflects intuitively correlations among shots. Due to the correlations among multi-view shots, the shot graph has complicated connectivity. This makes the summarization task challenging. We must generate the most representative graph nodes (shots) by taking into consideration the connections as well as users’ requirements. Our basic observation is that, with the shot graph, the multi-view video summarization can be formulated as a graph labeling problem. We accomplish this in two steps. We first cluster those event-centered similar shots, and pick out the candidates for summarization by random walks. Final summary is generated by a multi-objective optimization process that is specifically devised for accommodating different user requirements. A. Shot Clustering by Random Walks To cluster similar shots, we first sample a small number of important shots and then cluster the shots of the same events by random walks. We adopt random walks in this step rather than other graph partition algorithms such as graph cut because of the following reasons. On one hand, random walks hs proven to be effective in handling large and complex graphs, even in the presence of conspicuous noises. It is thus suitable for our clustering task which needs to partition the spatio-temporal shot graph with complicated node connections. Graph cut, however, is prone to the small cut and noise influence [47]. On the other hand, our graph partition is a -way segmentation problem given sampled shots indicating seeds for candidate clusters. Random walks algorithm works well for such problem. The random walker starts from each unsampled node (shot) and determines for it the most preferable sampled shot’s cluster. The final clusters thus obtained are actually event-centered. In general, many events can be represented as object activities and interactions, showing different motion patterns [48]. For the event captured by multi-view shots, similarities among shots in terms of visual, temporal, as well as semantic correlations should be large. In addition, each event may have at least a central shot which has a high shot importance. We can take it as one of the best views recording this event. The random walks-based shot clustering fulfills these requirements in that we select the shots with higher importance as seeded nodes. Such shots just can be viewed as the centers of events. Furthermore, the weight on graph is defined in form of shots’ similarities which makes clustering event relevant shots possible. Notice that the property of our event-centered clustering also facilitates video retrieval by allowing the user to specify their interested shots as seeds. The final clusters containing seeds are thus the retrieval results. Although a detailed description of random walks theory is beyond the scope of this paper, it essentially works as follows. First, we partition the node set into seeded nodes and unseeded nodes , satisfying that the value of each seed in exceeds an entropy threshold