Query video Negative candidate Query _中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）A Large-Scale Short Video Dataset for Near Duplicate Video Retrieval

正在加载图片...

dure are truly probable negative,we randomly sample 100 videos from the probable negative unlabeled set and invite human annotators to label them against each of the query videos.None of these videos is labeled as near-duplicate of cative candidat the queries.Therefore,the videos in the probable negative unlabeled set are not near-duplicates of the query videos with high probability. 4.Transformations Query vide Negative candidate In real applications,users might prefer to copy hot videos to gain attention.At the same time,these users usu- ally choose to modify their copied videos slightly to bypass the detection.These modifications contain video cropping, Ouery video Negative candidate border insertion and so on Figure 4.Example of hard negative videos.All the candidates are To mimic such user behavior.we define one temporal visually similar to the query but not near-duplicates. transformation,i.e.,video speeding,and three spatial trans- formations,i.e.,video cropping,black border insertion,and video rotation.Specifically,the video speeding transforma- query video,we select the top-5 to top-10 similar videos tion contains video speeding up and speeding down.This as candidate videos for human annotation. type of transformation is designed to simulate video accel- Figure 4 illustrates some examples of query videos and eration or deceleration.In real applications,users might the corresponding negative candidate videos,where the can- crop the videos to zoom in or out the original videos,which didate videos are mined based on deep features.In the ex- can be performed by frame cropping.Furthermore,users ample at the top row,a man is casting a net into the water.In might insert borders,like black borders,to fit different video the example at the middle row,a girl is doing her hairstyle size.In addition,there exist many mobile-phone videos in a barbershop.In the example at the bottom row,a girl is which are taken horizontally or vertically.When users u- playing in a room decorated with illuminations.However, pload these videos,they might rotate their videos. as the persons in each video pair are different,all of these These transformations are widely applied in the video video pairs are not near-duplicate videos although they are re-creation procedure.By performing these transformation- very similar s,harder candidates can be generated and we can construct more challenging datasets.Please note that the above trans- 3.3.Probable Negative Unlabeled Set formations are used as illustrating examples,and users can define their own transformations based on their needs. We first select a subset of 700.000 videos from the ambi- ent set as candidates for probable negative unlabeled videos. 5.Experiments which are defined as negative videos without human annota- tion.After extracting a variety of frame and video features, We perform experiments to study the retrieval perfor- we calculate the pairwise similarity between query videos mance on SVD dataset and other NDVR datasets.We adop- and the candidate videos.The candidate videos which t two categories of NDVR methods,i.e.,real-value based might be the near-duplicate videos of query videos with NDVR methods and hashing based NDVR methods.In real high probability will be filtered.Then the remaining can- applications,real-value based NDVR methods usually suf- didate videos are selected as probable negative unlabeled fer from high storage cost and low query speed.To avoid videos.Specifically,we utilize BSIFT features and aggre- high storage cost and enable fast query speed,hashing based gated deep features to calculate similarity between query methods [3,31,34,29,11,27,6]have also been adopted for videos and candidate videos.The BSIFT features are used NDVR to calculate the Jaccard similarity,and only those videos whose similarities to all queries are 0 can be selected as 5.1.Datasets candidate videos.Then the aggregated deep features are As TRECVID and MUSCLE_VCD are too smal- used to calculate video-level similarity based on Euclidean I and the original videos in background distraction set distance,and we further filter about 5%videos which have are not available for UQ_VIDEO,we select CCWE- the smallest similarities to all queries.In the end.we obtain B [32]and VCDB [9]for comparison with SVD. 526,787 videos for the probable negative unlabeled set. We adopt four transformations defined in Section 4 to To verify that the videos obtained by the above proce- construct more challenging variants of SVD.Specifi-Query video Negative candidate Query video Negative candidate Query video Negative candidate Figure 4. Example of hard negative videos. All the candidates are visually similar to the query but not near-duplicates. query video, we select the top-5 to top-10 similar videos as candidate videos for human annotation. Figure 4 illustrates some examples of query videos and the corresponding negative candidate videos, where the candidate videos are mined based on deep features. In the example at the top row, a man is casting a net into the water. In the example at the middle row, a girl is doing her hairstyle in a barbershop. In the example at the bottom row, a girl is playing in a room decorated with illuminations. However, as the persons in each video pair are different, all of these video pairs are not near-duplicate videos although they are very similar. 3.3. Probable Negative Unlabeled Set We first select a subset of 700,000 videos from the ambient set as candidates for probable negative unlabeled videos, which are defined as negative videos without human annotation. After extracting a variety of frame and video features, we calculate the pairwise similarity between query videos and the candidate videos. The candidate videos which might be the near-duplicate videos of query videos with high probability will be filtered. Then the remaining candidate videos are selected as probable negative unlabeled videos. Specifically, we utilize BSIFT features and aggregated deep features to calculate similarity between query videos and candidate videos. The BSIFT features are used to calculate the Jaccard similarity, and only those videos whose similarities to all queries are 0 can be selected as candidate videos. Then the aggregated deep features are used to calculate video-level similarity based on Euclidean distance, and we further filter about 5% videos which have the smallest similarities to all queries. In the end, we obtain 526,787 videos for the probable negative unlabeled set. To verify that the videos obtained by the above procedure are truly probable negative, we randomly sample 100 videos from the probable negative unlabeled set and invite human annotators to label them against each of the query videos. None of these videos is labeled as near-duplicate of the queries. Therefore, the videos in the probable negative unlabeled set are not near-duplicates of the query videos with high probability. 4. Transformations In real applications, users might prefer to copy hot videos to gain attention. At the same time, these users usually choose to modify their copied videos slightly to bypass the detection. These modifications contain video cropping, border insertion and so on. To mimic such user behavior, we define one temporal transformation, i.e., video speeding, and three spatial transformations, i.e., video cropping, black border insertion, and video rotation. Specifically, the video speeding transformation contains video speeding up and speeding down. This type of transformation is designed to simulate video acceleration or deceleration. In real applications, users might crop the videos to zoom in or out the original videos, which can be performed by frame cropping. Furthermore, users might insert borders, like black borders, to fit different video size. In addition, there exist many mobile-phone videos which are taken horizontally or vertically. When users upload these videos, they might rotate their videos. These transformations are widely applied in the video re-creation procedure. By performing these transformations, harder candidates can be generated and we can construct more challenging datasets. Please note that the above transformations are used as illustrating examples, and users can define their own transformations based on their needs. 5. Experiments We perform experiments to study the retrieval performance on SVD dataset and other NDVR datasets. We adopt two categories of NDVR methods, i.e., real-value based NDVR methods and hashing based NDVR methods. In real applications, real-value based NDVR methods usually suffer from high storage cost and low query speed. To avoid high storage cost and enable fast query speed, hashing based methods [3, 31, 34, 29, 11, 27, 6] have also been adopted for NDVR. 5.1. Datasets As TRECVID and MUSCLE VCD are too small and the original videos in background distraction set are not available for UQ VIDEO, we select CCWEB [32] and VCDB [9] for comparison with SVD. We adopt four transformations defined in Section 4 to construct more challenging variants of SVD. Specifi-

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）A Large-Scale Short Video Dataset for Near Duplicate Video Retrieval