正在加载图片...
Www 2008/ Refereed Track: Rich Media April 21-25, 2008. Beijing, China In the equation above, kr is a damping parameter The combined promotion function we apply on a tag pair vote10 6824.4626 u,c)is the following sum+25012379205405 vote+259114:79955527 promotion(u, c): =rank(u, c). stability (u).descriptive(c Table When applying the promotion function in combination with Optimal paramet either the voting or summing aggregation function, the score performance for our tag recoer settings and syster function is update as presented below for the voting case 5.3 Assessments scorec):= >vote(u, c).promotion(u, c) (10) The ground truth is manually created through a blind review pooling method, where for each of the 331 photos, the The tag recommendation system now contains a set of top 10 recommendations from each of the four strategies was parameters(m, kr, ks, kd)which have to be configured. We taken to construct the pool. The assessors were then asked Ise a training set, as described in the next section to derive to assess the descriptiveness of each of the recommended the proper configuration of these parameters. Furthermore ags in context of the photo. To help them in their task, the re will evaluate the performance of the promotion function, assessors were presented the photo, title, tags, owner name. with respect to the two aggregation strategies in Section 6. and the description. They could access and view the photo . e, we evaluate the four different strategies as presented in directly on Flickr, to find additional context when needed Table 2 The assessors were asked to judge the descriptiveness on a I vote sum four-point scale: very good, good, not good, and don't know The distinction between very good and good is defined, to promotionvote make the assessment task conceptually easier for the user. For the evaluation of the results, we will however use a bi nary judgement, and map both scales to good. In some Table 2: The four tag recommendation strategies cases, we expected that the assessor would not be able to explored in this paper. make a good judgement, simply because there is not enough contextual information, or when the expertise of the asses- sor is not sufficient to make a motivated choice. For this 5. EXPERIMENTAL SETUP purpose, we added the option don't know In the following experiment we compare the four different The assessment pool contains 972 very good judgements, ag recommendation strategies through an empirical evalu- nd 984 good judgements. In 2811 cases the judgement was ation. In this section we define the experimental setup and not good, and in 289 cases it was undecided(don't know) shortly present the system optimisation results, while the 5.4 Evaluation metrics evaluation results are presented in Section 6 For the evaluation of the task, we adopted three metrics, 5.1 Task that capture the performance at different aspects We have defined the following task: Given a Flickr photo Mean Reciprocal Rank(MRr)MRR measures where and a set of user-defined tags the system has to recommend in the ranking the first relevant -i.e, descriptive tags that are good descriptors of the photo. In our evalu- tag is returned by the system, averaged over all the ation we set this up as a ranking problem, i. e, the system photos. This measure provides insight in the ability of nieves a list of tags where the tags are ranked by decreas- the system to return a relevant tag at the top of the ing likelihood of being a good descriptor for the photo. In ranking. an operational setting, such a system is expected to present the recommended tags to the user, such that she can extend Success at rank k(sak)We report the success at rank the annotation by selecting the relevant tags from the list k for two values of k: So1 and Sa5. The success at rank k is defined as the probability of finding a good 5.2 Photo Collection descriptive tag among the top k recommended tags For the evaluation we have Precision at rank k(Pok) We report the precision at the Flickr API. The selected are based on a rank 5(P@5). Precision at rank k is defined as the f high level topics, for example"basketball","Iceland", and roportion of retrieved tags that is relevant, averaged sailing, that were chosen by the assessors to ensure that hey possessed the necessary expertise to judge the relevancy of the recommended tags in context of the photo 5.5 System Tuning In addition, we ensured that the photos were evenly dis. We used the training set of 131 photos to tune the pa- tributed over the different tag classes as defined in Table 1 rameters of our system. Recall from the previous section of Section 3, to have variation in the exhaustiveness of the that our baseline strategies have one parameter m and our annotations. Despite these two manipulations, the phot promotion strategies have additional three parameters ks selection process was randomised kd, and kr. We tuned our four strategies by performing a Finally, we have divided the photo pool in a training set parameter-sweep and maximising system performance both and a test set. For training we used 131 photos and the test in terms of MrR and P@5. Table 3 shows the optimal pa- set consists of 200 photo rameter settings and system performance for the four tagIn the equation above, kr is a damping parameter. The combined promotion function we apply on a tag pair (u, c) is the following: promotion(u, c) := rank(u, c) · stability(u) · descriptive(c) (9) When applying the promotion function in combination with either the voting or summing aggregation function, the score function is update as presented below for the voting case: score(c) := X u∈U vote(u, c) · promotion(u, c) (10) The tag recommendation system now contains a set of parameters (m, kr, ks, kd) which have to be configured. We use a training set, as described in the next section to derive the proper configuration of these parameters. Furthermore, we will evaluate the performance of the promotion function, with respect to the two aggregation strategies in Section 6. I.e., we evaluate the four different strategies as presented in Table 2. vote sum no-promotion vote sum promotion vote+ sum+ Table 2: The four tag recommendation strategies explored in this paper. 5. EXPERIMENTAL SETUP In the following experiment we compare the four different tag recommendation strategies through an empirical evalu￾ation. In this section we define the experimental setup and shortly present the system optimisation results, while the evaluation results are presented in Section 6. 5.1 Task We have defined the following task: Given a Flickr photo and a set of user-defined tags the system has to recommend tags that are good descriptors of the photo. In our evalu￾ation we set this up as a ranking problem, i.e., the system retrieves a list of tags where the tags are ranked by decreas￾ing likelihood of being a good descriptor for the photo. In an operational setting, such a system is expected to present the recommended tags to the user, such that she can extend the annotation by selecting the relevant tags from the list. 5.2 Photo Collection For the evaluation we have selected 331 photos through the Flickr API. The selected photos are based on a series of high level topics, for example “basketball”, “Iceland”, and “sailing”, that were chosen by the assessors to ensure that they possessed the necessary expertise to judge the relevancy of the recommended tags in context of the photo. In addition, we ensured that the photos were evenly dis￾tributed over the different tag classes as defined in Table 1 of Section 3, to have variation in the exhaustiveness of the annotations. Despite these two manipulations, the photo selection process was randomised. Finally, we have divided the photo pool in a training set and a test set. For training we used 131 photos and the test set consists of 200 photos. m ks kd kr MRR P@5 sum 10 - - - .7779 .5252 vote 10 - - - .6824 .4626 sum+ 25 0 12 3 .7920 .5405 vote+ 25 9 11 4 .7995 .5527 Table 3: Optimal parameter settings and system performance for our tag recommendation strategies. 5.3 Assessments The ground truth is manually created through a blind review pooling method, where for each of the 331 photos, the top 10 recommendations from each of the four strategies was taken to construct the pool. The assessors were then asked to assess the descriptiveness of each of the recommended tags in context of the photo. To help them in their task, the assessors were presented the photo, title, tags, owner name, and the description. They could access and view the photo directly on Flickr, to find additional context when needed. The assessors were asked to judge the descriptiveness on a four-point scale: very good, good, not good, and don’t know. The distinction between very good and good is defined, to make the assessment task conceptually easier for the user. For the evaluation of the results, we will however use a bi￾nary judgement, and map both scales to good. In some cases, we expected that the assessor would not be able to make a good judgement, simply because there is not enough contextual information, or when the expertise of the asses￾sor is not sufficient to make a motivated choice. For this purpose, we added the option don’t know. The assessment pool contains 972 very good judgements, and 984 good judgements. In 2811 cases the judgement was not good, and in 289 cases it was undecided (don’t know). 5.4 Evaluation Metrics For the evaluation of the task, we adopted three metrics, that capture the performance at different aspects: Mean Reciprocal Rank (MRR) MRR measures where in the ranking the first relevant – i.e., descriptive – tag is returned by the system, averaged over all the photos. This measure provides insight in the ability of the system to return a relevant tag at the top of the ranking. Success at rank k (S@k) We report the success at rank k for two values of k: S@1 and S@5. The success at rank k is defined as the probability of finding a good descriptive tag among the top k recommended tags. Precision at rank k (P@k) We report the precision at rank 5 (P@5). Precision at rank k is defined as the proportion of retrieved tags that is relevant, averaged over all photos. 5.5 System Tuning We used the training set of 131 photos to tune the pa￾rameters of our system. Recall from the previous section that our baseline strategies have one parameter m and our promotion strategies have additional three parameters ks, kd, and kr. We tuned our four strategies by performing a parameter-sweep and maximising system performance both in terms of MRR and P@5. Table 3 shows the optimal pa￾rameter settings and system performance for the four tag 332 WWW 2008 / Refereed Track: Rich Media April 21-25, 2008. Beijing, China
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有