正在加载图片...
cally,we utilize SVDtrans formation to denote a vari- For real-value based NDVR methods,Euclidean distance ant of the SVD dataset,where the labeled positive is used to rank the retrieved data points.For hashing based videos are replaced by the corresponding transformed NDVR methods,we learn a binary code for each video. videos.Here the transformation denotes the transfor- Then the Hamming distance is used as the metric to rank mations defined in Section 4,i.e.,transformation E the retrieved data points. {Cropping,Black Border,Rotation,Speeding).Please To further improve the retrieval accuracy for hashing note that we adopt acceleration transformation for methods,we can utilize reranking strategy.Specifically,we SVDspeeding For all datasets,the groundtruth videos of a first use Hamming distance to generate a ranked list for all given query video are defined as the labeled positive videos. returned videos.Then we select top-N returned videos to run reranking algorithm.During the reranking procedure, 5.2.Benchmark and Evaluation Protocol we calculate Euclidean distance between query video and 5.2.1 Benchmark the selected top-N videos based on deep features and get the final ranked list for the selected N videos based on the For real-value based methods,we adopt four widely used Euclidean distance. real-valued NDVR methods,including three video-level methods,i.e.,layer-wise convolutional neural network(C- 5.2.2 Evaluation Protocol NNL)[12],vector-wise convolutional neural network (C- NNV)[12]and deep metric learning (DML)[13],and For CCWEB and VCDB datasets,following the setting of one frame-level method,i.e.,circulant temporal encod- DML [13],we utilize the query set and labeled set as train- ing(CTE)[24]. ing set.During testing procedure,we utilize the query set as In real applications,real-value based methods might be test set and the labeled set as database for CCWEB dataset. impractical for massive videos.Hence,we also adopt some Then the retrieval procedure is performed by adopting test hashing methods for evaluation.Specifically,we adopt four set to retrieve database.For VCDB dataset,we select query hashing methods,including one data-independent method, set as test set.Furthermore,we utilize the labeled set and i.e.,locality sensitive hashing (LSH)[3].two unsuper- background distraction set as database.For SVD dataset. vised hashing methods,i.e.,iterative quantization (ITQ)[5] we randomly select 1,000 query videos from query set and and isotropic hashing (IsoH)[11],and one supervised their labeled videos as training set.During testing proce- hashing method,i.e.,Hamming distance metric learn- dure,we utilize the remaining 206 query videos from query ing (HDML)[20],for evaluation.In this paper,we just use set as test set.Furthermore,the corresponding labeled set four hashing methods for demonstration,although more so- and the whole probable negative unlabeled set are utilized phisticated hashing methods can be adopted to further im- as database. prove the performance [15]. We utilize mean average precision (MAP)and top-K For real-value based NDVR methods,following the set- MAP as evaluation metrics.Specifically,for each query ting of DML [13],we utilize VGG16-Net [28]pre-trained video va,the average precision (AP)is calculated according on ImageNet [25]to extract 4096D deep features for every to the following equation: frame.For all datasets,we set fps 1 for fair compari- son5.After extracting deep features for each frame,we uti- AP(vq)= Pa(k)1k, (1) lize the same normalization strategy as that in DML,i.e., R 1 zero-mean and L2-normalization,to generate video-level deep features.DML is a triplet-based deep metric learning where Ra is the number of labeled positive videos,M de- method.For all datasets,we utilize hard triplets sampling notes the number of videos in the database,P(k)is the strategy proposed by [13].For CNNL and CNNV methods, precision at cut-off k in the ranked list for video vg and 1k we also utilize 4096D deep features extracted by VGG16- is an indicator function which equals 1 if the k-th returned Net pre-trained on ImageNet.For all datasets,we randomly video is the groundtruth of query video,otherwise =0. sample 50,000 frames to learn 300 centers by k-means al- Then given n query videos,we can calculate MAP as fol- gorithm for CNNL and CNNV methods.For hashing based lows: methods,we also use the 4096D deep features extracted MAP= 1 n by VGG16-Net to perform hashing learning for fair com- AP(va) parison.For all baselines except CNNL,CNNV and CTE, q=1 source code is kindly provided by their authors.For CNNL, CNNV and CTE,we carefully implement these methods. The top-K MAP can be calculated similarly by setting M =K in Equation (1).Furthermore,we also compare 6CTE achieves higher accuracy with fps=15 on CCWEB dataset.In the storage cost and retrieval time for real-value based ND- this paper.we set fps=1 for fair comparison. VR methods and hashing based NDVR methods.cally, we utilize SVDtransformation to denote a vari￾ant of the SVD dataset, where the labeled positive videos are replaced by the corresponding transformed videos. Here the transformation denotes the transfor￾mations defined in Section 4, i.e., transformation ∈ {Cropping, BlackBorder, Rotation, Speeding}. Please note that we adopt acceleration transformation for SVDSpeeding. For all datasets, the groundtruth videos of a given query video are defined as the labeled positive videos. 5.2. Benchmark and Evaluation Protocol 5.2.1 Benchmark For real-value based methods, we adopt four widely used real-valued NDVR methods, including three video-level methods, i.e., layer-wise convolutional neural network (C￾NNL) [12], vector-wise convolutional neural network (C￾NNV) [12] and deep metric learning (DML) [13], and one frame-level method, i.e., circulant temporal encod￾ing (CTE) [24]. In real applications, real-value based methods might be impractical for massive videos. Hence, we also adopt some hashing methods for evaluation. Specifically, we adopt four hashing methods, including one data-independent method, i.e., locality sensitive hashing (LSH) [3], two unsuper￾vised hashing methods, i.e., iterative quantization (ITQ) [5] and isotropic hashing (IsoH) [11], and one supervised hashing method, i.e., Hamming distance metric learn￾ing (HDML) [20], for evaluation. In this paper, we just use four hashing methods for demonstration, although more so￾phisticated hashing methods can be adopted to further im￾prove the performance [15]. For real-value based NDVR methods, following the set￾ting of DML [13], we utilize VGG16-Net [28] pre-trained on ImageNet [25] to extract 4096D deep features for every frame. For all datasets, we set f ps = 1 for fair compari￾son6 . After extracting deep features for each frame, we uti￾lize the same normalization strategy as that in DML, i.e., zero-mean and L2-normalization, to generate video-level deep features. DML is a triplet-based deep metric learning method. For all datasets, we utilize hard triplets sampling strategy proposed by [13]. For CNNL and CNNV methods, we also utilize 4096D deep features extracted by VGG16- Net pre-trained on ImageNet. For all datasets, we randomly sample 50,000 frames to learn 300 centers by k-means al￾gorithm for CNNL and CNNV methods. For hashing based methods, we also use the 4096D deep features extracted by VGG16-Net to perform hashing learning for fair com￾parison. For all baselines except CNNL, CNNV and CTE, source code is kindly provided by their authors. For CNNL, CNNV and CTE, we carefully implement these methods. 6CTE achieves higher accuracy with f ps = 15 on CCWEB dataset. In this paper, we set f ps = 1 for fair comparison. For real-value based NDVR methods, Euclidean distance is used to rank the retrieved data points. For hashing based NDVR methods, we learn a binary code for each video. Then the Hamming distance is used as the metric to rank the retrieved data points. To further improve the retrieval accuracy for hashing methods, we can utilize reranking strategy. Specifically, we first use Hamming distance to generate a ranked list for all returned videos. Then we select top-N returned videos to run reranking algorithm. During the reranking procedure, we calculate Euclidean distance between query video and the selected top-N videos based on deep features and get the final ranked list for the selected N videos based on the Euclidean distance. 5.2.2 Evaluation Protocol For CCWEB and VCDB datasets, following the setting of DML [13], we utilize the query set and labeled set as train￾ing set. During testing procedure, we utilize the query set as test set and the labeled set as database for CCWEB dataset. Then the retrieval procedure is performed by adopting test set to retrieve database. For VCDB dataset, we select query set as test set. Furthermore, we utilize the labeled set and background distraction set as database. For SVD dataset, we randomly select 1,000 query videos from query set and their labeled videos as training set. During testing proce￾dure, we utilize the remaining 206 query videos from query set as test set. Furthermore, the corresponding labeled set and the whole probable negative unlabeled set are utilized as database. We utilize mean average precision (MAP) and top-K MAP as evaluation metrics. Specifically, for each query video vq, the average precision (AP) is calculated according to the following equation: AP(vq) = 1 Rq X M k=1 Pq(k)1k, (1) where Rq is the number of labeled positive videos, M de￾notes the number of videos in the database, Pq(k) is the precision at cut-off k in the ranked list for video vq and 1k is an indicator function which equals 1 if the k-th returned video is the groundtruth of query video, otherwise 1k = 0. Then given n query videos, we can calculate MAP as fol￾lows: MAP = 1 n Xn q=1 AP(vq). The top-K MAP can be calculated similarly by setting M = K in Equation (1). Furthermore, we also compare the storage cost and retrieval time for real-value based ND￾VR methods and hashing based NDVR methods
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有