正在加载图片...
726 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 is integrated into the constraint that controls total length of sum- around crests of attention curve,the method does not provide a mary.A2 is used to adjust information coverage.Increasing A1 mechanism to remove content redundancy among multi-views and A2 simultaneously will generate a long and meanwhile in- It is obvious that the summaries produced by the method contain formative summary. much redundant information.There exist significant temporal The multi-view badminton videos are summarized into three overlaps among summarized multi-views shots.Most events are levels,according to the length and information entropy set for simultaneously recorded in the summaries. the summary.The parameter A is set to 0.035,0.075,and 0.15, By using our multi-view summarization method,such redun- respectively,on the 1st,2nd,and 3rd level.A2 is set to 0.6.0.65,dancy is largely reduced in contrast.Some events are recorded and 0.7 accordingly.Obviously,the high-level summary covers by the most informative summarized shots,while the most most part of low-level summary,while reasonable disparity is important events are reserved in multi-view summaries.Some due to the different optimization procedures involved.The low- events that are ignored by previous method-for instance the level summary comprises the most highly repeated actions,such events recorded from Ist to 5th second,14th to 18th second,and as serve,smash,and dead bird.Such statistics can be used for 39th to 41st second in our officel single video summary-are badminton training.The high level summary in contrast appends reserved by our method in contrast.This is determined by more amazing rally,e.g.,the shots 67,79,124,135,and 154 on our shot clustering algorithm and multi-objective optimization level 3. operated on the spatio-temporal shot graph.Such property of Other examples of multi-level summarization include the of-our method facilitates generating a short-length.yet highly fice lobby and road videos.We summarize both of them into informative summary. two levels by setting A2 to 0.6 and 0.7,respectively.In gen- We also compare our algorithm against a graph-based sum- eral,the videos containing many events with different shot im- marization method.A single video is first formed by combining portance values are more suitable for multi-level summariza- the multi-view videos along the timeline.We then construct the tion.For such videos,the low-level summary contains the shots graph according tothe method givenin[10].Finalsummary is pro- which are enough to describe most of the original video events. duced by using normalized cut-based event clustering and high- The high-level compact summary,by contrast,comprises the lightdetection[14].Normalizedcut widelyemployedby previous events which are more active or salient. methods often suffers from the"small cut problem.This can be There are some discussions about the choice of and A2.In- problematic when the method uses heuristic criterion to select tuitively,A2 is used to control importance value of the summary. highlight fromeventclusters as summary.Thatis,someimportant In our method,shot importance is evaluated by the entropy de- events with short durations are missed.Our method,however,can fined in terms of low-level features and updated by high-level meet different summarization objectives by using the multi-ob- semantics.The total entropy of those shots that are discarded for jective optimization.Important events with much higher impor- their lower activities is too low to be taken into account.There-tance are reserved in multi-views,while some important events fore,we can relatively safely assume that all reserved shots con- with shot durations are preserved as well. tain most information of multi-view videos.A2 thus can be re- To quantitatively compare our method with previous ones, garded as the minimum percent information to be preserved in we use precision and recall to measure the performance.We in- summary.In implementation,入2 is given by user.For入1,we vited five graduate students who remained unknown about our try it from 0.05 to 1 with an increment of 0.05,and select the research to define the ground-truth video summaries.Each shot one ensuring a solution for (15)as A1. is labeled as a ground-truth shot only if the five guys agree with Computational complexity of our method mainly depends on each other.For the officel multi-view videos,totally 26 shots the lengths.resolutions,and activities of the multi-view videos. are labeled as ground-truth shots.The ground-truth summary The major cost is spent on video parsing and graph construction, of campus videos includes 29 shots.Precision and recall scores which take about 15 min for the officel example.In contrast,sum- of the methods are shown in Table II.Accurately controlling marization with random walks-based clustering and multi-objec- the summary lengths is difficult.The summaries of different tive optimization is fast.This step spends less than 1 min.since methods are all around 50 s.except the campus summary ob- the graph constructed only has nearly 60 nodes.Video summa- tained by the graph method [10],[14]is 109 s.The second/sixth rization is often used as a post-processing tool.Our method can row is the data computed by applying the method [11]to each be accelerated by high-performance computing system. view video separately.The third/seventh row is generated by applying it to the single video formed by first combining each view.Generally,for the officel multi-view videos,from the pre- A.Comparison With Mono-View Summarization cision scores,summaries obtained by each method belong to the We compare our method with previous mono-view video ground-truth.In contrast,precisions of the four methods com- summarization methods.The summaries produced by our puted on the campus videos are all around 50%.The campus method and previous ones are shown in the demo webpage. videos contain many trivial events.It is challenging to generate We implement the video summarization method presented an unbiased summary using the methods.The last column of in [11]and apply it to each view of the multi-view officel, the table indicates that our method is superior to others in terms campus,and office lobby videos.For each multi-view video, of recall.This suggests that our method is more effective in re- we combine the resulting shots along the timeline to form a moving content redundancy. single video summary.For a fair comparison,we also use the above method to summarize the single video formed by com- B.User Study bining the multi-view videos along the timeline,and generate To further evaluate the effectiveness of our method,we have a dynamic single video summary.As the summary is extracted carried out a user study.The aim is to assess the enjoyability,726 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 is integrated into the constraint that controls total length of sum￾mary. is used to adjust information coverage. Increasing and simultaneously will generate a long and meanwhile in￾formative summary. The multi-view badminton videos are summarized into three levels, according to the length and information entropy set for the summary. The parameter is set to 0.035, 0.075, and 0.15, respectively, on the 1st, 2nd, and 3rd level. is set to 0.6, 0.65, and 0.7 accordingly. Obviously, the high-level summary covers most part of low-level summary, while reasonable disparity is due to the different optimization procedures involved. The low￾level summary comprises the most highly repeated actions, such as serve, smash, and dead bird. Such statistics can be used for badminton training. The high level summary in contrast appends more amazing rally, e.g., the shots 67, 79, 124, 135, and 154 on level 3. Other examples of multi-level summarization include the of- fice lobby and road videos. We summarize both of them into two levels by setting to 0.6 and 0.7, respectively. In gen￾eral, the videos containing many events with different shot im￾portance values are more suitable for multi-level summariza￾tion. For such videos, the low-level summary contains the shots which are enough to describe most of the original video events. The high-level compact summary, by contrast, comprises the events which are more active or salient. There are some discussions about the choice of and . In￾tuitively, is used to control importance value of the summary. In our method, shot importance is evaluated by the entropy de- fined in terms of low-level features and updated by high-level semantics. The total entropy of those shots that are discarded for their lower activities is too low to be taken into account. There￾fore, we can relatively safely assume that all reserved shots con￾tain most information of multi-view videos. thus can be re￾garded as the minimum percent information to be preserved in summary. In implementation, is given by user. For , we try it from 0.05 to 1 with an increment of 0.05, and select the one ensuring a solution for (15) as . Computational complexity of our method mainly depends on the lengths, resolutions, and activities of the multi-view videos. The major cost is spent on video parsing and graph construction, whichtake about 15min forthe office1 example. In contrast, sum￾marization with random walks-based clustering and multi-objec￾tive optimization is fast. This step spends less than 1 min, since the graph constructed only has nearly 60 nodes. Video summa￾rization is often used as a post-processing tool. Our method can be accelerated by high-performance computing system. A. Comparison With Mono-View Summarization We compare our method with previous mono-view video summarization methods. The summaries produced by our method and previous ones are shown in the demo webpage. We implement the video summarization method presented in [11] and apply it to each view of the multi-view office1, campus, and office lobby videos. For each multi-view video, we combine the resulting shots along the timeline to form a single video summary. For a fair comparison, we also use the above method to summarize the single video formed by com￾bining the multi-view videos along the timeline, and generate a dynamic single video summary. As the summary is extracted around crests of attention curve, the method does not provide a mechanism to remove content redundancy among multi-views. It is obvious that the summaries produced by the method contain much redundant information. There exist significant temporal overlaps among summarized multi-views shots. Most events are simultaneously recorded in the summaries. By using our multi-view summarization method, such redun￾dancy is largely reduced in contrast. Some events are recorded by the most informative summarized shots, while the most important events are reserved in multi-view summaries. Some events that are ignored by previous method—for instance the events recorded from 1st to 5th second, 14th to 18th second, and 39th to 41st second in our office1 single video summary—are reserved by our method in contrast. This is determined by our shot clustering algorithm and multi-objective optimization operated on the spatio-temporal shot graph. Such property of our method facilitates generating a short-length, yet highly informative summary. We also compare our algorithm against a graph-based sum￾marization method. A single video is first formed by combining the multi-view videos along the timeline. We then construct the graphaccordingtothemethodgivenin [10].Finalsummaryispro￾duced by using normalized cut-based event clustering and high￾lightdetection [14].Normalizedcutwidelyemployedbyprevious methods often suffers from the “small cut” problem. This can be problematic when the method uses heuristic criterion to select highlight fromeventclustersas summary.Thatis, someimportant eventswith short durations aremissed. Ourmethod, however, can meet different summarization objectives by using the multi-ob￾jective optimization. Important events with much higher impor￾tance are reserved in multi-views, while some important events with shot durations are preserved as well. To quantitatively compare our method with previous ones, we use precision and recall to measure the performance. We in￾vited five graduate students who remained unknown about our research to define the ground-truth video summaries. Each shot is labeled as a ground-truth shot only if the five guys agree with each other. For the office1 multi-view videos, totally 26 shots are labeled as ground-truth shots. The ground-truth summary of campus videos includes 29 shots. Precision and recall scores of the methods are shown in Table II. Accurately controlling the summary lengths is difficult. The summaries of different methods are all around 50 s, except the campus summary ob￾tained by the graph method [10], [14] is 109 s. The second/sixth row is the data computed by applying the method [11] to each view video separately. The third/seventh row is generated by applying it to the single video formed by first combining each view. Generally, for the office1 multi-view videos, from the pre￾cision scores, summaries obtained by each method belong to the ground-truth. In contrast, precisions of the four methods com￾puted on the campus videos are all around 50%. The campus videos contain many trivial events. It is challenging to generate an unbiased summary using the methods. The last column of the table indicates that our method is superior to others in terms of recall. This suggests that our method is more effective in re￾moving content redundancy. B. User Study To further evaluate the effectiveness of our method, we have carried out a user study. The aim is to assess the enjoyability
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有