正在加载图片...
cameras,while most short videos are generated by amateurs videos.Then the authors collect 12,790 videos as labeled with mobile devices.Hence,the short videos might contain set.The average duration for this dataset is 151.02 seconds. some new types of near-duplicates,e.g.,horizontal/vertical In this dataset,over half of the queries are about dancing screen videos and camera shaking videos.Secondly,as the and singing,which is lack of diversity. cost of editing a short video is cheaper,users might prefer UQ_VIDEO [29]is an extended dataset of CCWEB.The to edit a short video.Hence,the number of near-duplicate authors utilize 24 query videos and 12,790 labeled videos short videos is larger than that of near-duplicate long videos. of CCWEB as the query set and labeled set for UQ_VIDEO Therefore,there is an urgent need of a large-scale short dataset,respectively.Then the authors construct a back- video dataset for NDVR task. ground distraction set with 119,833 videos.The videos in In this paper,we introduce a new large-scale short video background distraction set are usually treated as negative, dataset,called SVD,to foster research of NDVR for short but the labels are not verified by humans.In the end,the au- videos.The main contributions of this paper are listed as thors collect 132.647 videos in total.Although UO_VIDEO follows: is larger than CCWEB,it is also lack of diversity due to the The introduced SVD dataset contains over 500.000 limited number of queries.Furthermore,for all background short videos and over 30,000 labeled videos for ND- distraction videos,this dataset only provides HSV [26]fea- VR task.To the best of our knowledge,SVD is the first tures and LBP [7]features of all key frames,and the original large-scale short video dataset for NDVR task.Com- videos are not publically available. pared with existing NDVR datasets,SVD dataset is the VCDB [9]dataset utilizes the same 528 videos to con- largest one. struct both query set and labeled set.Furthermore,the au- thors provide 100,000 background distraction videos.Thus ● With hard labeled positive/negative videos mined by this dataset contains 100.528 videos in total.Furthermore. multiple strategies,SVD dataset is challenging for VCDB dataset is originally proposed for copyright detec- NDVR.Furthermore,we design some temporal and s- tion task,and only provides 9,236 copied segment label- patial transformations to mimic user behavior in real s.However,for NDVR task,we need video-level pair- applications and construct more difficult and challeng- wise labels to denote whether a candidate video is the ing variants of SVD near-duplicate video of the query video or not.Hence, We perform two categories of retrieval to evaluate the we filter redundant copied segment pairwise labels and get performance of existing state-of-the-art NDVR meth- 6,139 video-level pairwise labels for NDVR task.Please ods on SVD dataset,i.e.,real-value based retrieval and note that all 6,139 video-level pairwise labels are positive. hashing based retrieval.Experiments demonstrate that The average duration of the VCDB dataset is 72.77 seconds. these NDVR methods cannot achieve satisfactory re- MUSCLE_VCD [14]collects 18 videos to construc- trieval performance on SVD dataset. Hence,the re- t query set.Then the authors utilize query videos to gen- lease of SVD dataset will foster the research of the erate 101 videos as labeled set based on some predefined NDVR area. transformations.Thus MUSCLE_VCD dataset collects 119 videos in total. The rest of this paper is organized as follows.In Sec- TRECVID [22]dataset utilizes 11,256 query videos to tion 2,we briefly review the related work.In Section 3, construct query set.Then the authors use query videos to we describe the dataset collection strategies in detail.In generate 11,503 videos as labeled set based on some pre- Section 4,we introduce some temporal and spatial trans- defined transformations.Thus TRECVID dataset collects formations applied to SVD dataset.In Section 5,we carry 22,759 videos in total. out experiments on SVD dataset.At last,we conclude our The above datasets have been widely used for ND- paper in Section 6. VR task.All of these datasets are long video datasets 2.Related Work and have different shortcomings.Specifically,the videos of TRECVID and UQ_VIDEO datasets are not publicly We briefly review the datasets for NDVR task in this sec- available.MUSCLE_VCD and TRECVID datasets are tion.Specifically,related datasets include CCWEB [32]. small-scale and the labeled videos of these two datasets are UQ_VIDEO [29],VCDB [9],MUSCLE_VCD [14],and generated by the authors of the datasets rather than the users TRECVID [22]datasets. of real video platforms.CCWEB and UQ_VIDEO datasets CCWEB [32]dataset contains 24 query videos and are lack of diversity.VCDB dataset only contains positive 12,790 labeled videos.The authors utilize 24 text queries, pairwise labels.The second to the sixth columns of Table 1 eg,“The lion sleeps tonight'"and“Evolution of dance”,to list the statistics of the aforementioned datasets.From Ta- retrieve the videos from Youtube,Google Video,and Ya- ble 1,we can find that all existing NDVR datasets are long hoo!Video.The returned videos contain 27%redundant videos with average duration longer than 60 seconds.cameras, while most short videos are generated by amateurs with mobile devices. Hence, the short videos might contain some new types of near-duplicates, e.g., horizontal/vertical screen videos and camera shaking videos. Secondly, as the cost of editing a short video is cheaper, users might prefer to edit a short video. Hence, the number of near-duplicate short videos is larger than that of near-duplicate long videos. Therefore, there is an urgent need of a large-scale short video dataset for NDVR task. In this paper, we introduce a new large-scale short video dataset, called SVD, to foster research of NDVR for short videos. The main contributions of this paper are listed as follows: • The introduced SVD dataset contains over 500,000 short videos and over 30,000 labeled videos for ND￾VR task. To the best of our knowledge, SVD is the first large-scale short video dataset for NDVR task. Com￾pared with existing NDVR datasets, SVD dataset is the largest one. • With hard labeled positive/negative videos mined by multiple strategies, SVD dataset is challenging for NDVR. Furthermore, we design some temporal and s￾patial transformations to mimic user behavior in real applications and construct more difficult and challeng￾ing variants of SVD. • We perform two categories of retrieval to evaluate the performance of existing state-of-the-art NDVR meth￾ods on SVD dataset, i.e., real-value based retrieval and hashing based retrieval. Experiments demonstrate that these NDVR methods cannot achieve satisfactory re￾trieval performance on SVD dataset. Hence, the re￾lease of SVD dataset will foster the research of the NDVR area. The rest of this paper is organized as follows. In Sec￾tion 2, we briefly review the related work. In Section 3, we describe the dataset collection strategies in detail. In Section 4, we introduce some temporal and spatial trans￾formations applied to SVD dataset. In Section 5, we carry out experiments on SVD dataset. At last, we conclude our paper in Section 6. 2. Related Work We briefly review the datasets for NDVR task in this sec￾tion. Specifically, related datasets include CCWEB [32], UQ VIDEO [29], VCDB [9], MUSCLE VCD [14], and TRECVID [22] datasets. CCWEB [32] dataset contains 24 query videos and 12,790 labeled videos. The authors utilize 24 text queries, e.g., “The lion sleeps tonight” and “Evolution of dance”, to retrieve the videos from Youtube, Google Video, and Ya￾hoo! Video. The returned videos contain 27% redundant videos. Then the authors collect 12,790 videos as labeled set. The average duration for this dataset is 151.02 seconds. In this dataset, over half of the queries are about dancing and singing, which is lack of diversity. UQ VIDEO [29] is an extended dataset of CCWEB. The authors utilize 24 query videos and 12,790 labeled videos of CCWEB as the query set and labeled set for UQ VIDEO dataset, respectively. Then the authors construct a back￾ground distraction set with 119,833 videos. The videos in background distraction set are usually treated as negative, but the labels are not verified by humans. In the end, the au￾thors collect 132,647 videos in total. Although UQ VIDEO is larger than CCWEB, it is also lack of diversity due to the limited number of queries. Furthermore, for all background distraction videos, this dataset only provides HSV [26] fea￾tures and LBP [7] features of all key frames, and the original videos are not publically available. VCDB [9] dataset utilizes the same 528 videos to con￾struct both query set and labeled set. Furthermore, the au￾thors provide 100,000 background distraction videos. Thus this dataset contains 100,528 videos in total. Furthermore, VCDB dataset is originally proposed for copyright detec￾tion task, and only provides 9,236 copied segment label￾s. However, for NDVR task, we need video-level pair￾wise labels to denote whether a candidate video is the near-duplicate video of the query video or not. Hence, we filter redundant copied segment pairwise labels and get 6,139 video-level pairwise labels for NDVR task. Please note that all 6,139 video-level pairwise labels are positive. The average duration of the VCDB dataset is 72.77 seconds. MUSCLE VCD [14] collects 18 videos to construc￾t query set. Then the authors utilize query videos to gen￾erate 101 videos as labeled set based on some predefined transformations. Thus MUSCLE VCD dataset collects 119 videos in total. TRECVID [22] dataset utilizes 11,256 query videos to construct query set. Then the authors use query videos to generate 11,503 videos as labeled set based on some pre￾defined transformations. Thus TRECVID dataset collects 22,759 videos in total. The above datasets have been widely used for ND￾VR task. All of these datasets are long video datasets and have different shortcomings. Specifically, the videos of TRECVID and UQ VIDEO datasets are not publicly available. MUSCLE VCD and TRECVID datasets are small-scale and the labeled videos of these two datasets are generated by the authors of the datasets rather than the users of real video platforms. CCWEB and UQ VIDEO datasets are lack of diversity. VCDB dataset only contains positive pairwise labels. The second to the sixth columns of Table 1 list the statistics of the aforementioned datasets. From Ta￾ble 1, we can find that all existing NDVR datasets are long videos with average duration longer than 60 seconds
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有