正在加载图片...
provider will give to the receiver. The second part is the prob- to 10 distinct tags for each webpage based on the ground truth ability of the tag appearing in the provider. The third part dataset, any URL annotated with fewer than 50 tags by all takes into consideration how trustworthy the provider is. We users were removed from the set. In other words, the thresh- use the total number of tags given to a URl to estimate the old is set at 50 so that most documents have more than 10 extent the provider can be trusted. URLS with more total tags distinct tags as the ground truth annotated can be trusted more. Therefore the formula is as However, given that most users tend to specify just a lim follows ited number of tags, many URLs have insufficient tag infor- mation even with the above filtering in place. In order to re- duce the problem, the association rule algorithm [Agrawal et Prop(u, x)=2Sim(u, w)X Pt(u, a)x log> f(u, 1)al., 1993] has been adopted to expand the tag set of a URL. u∈U Experiments in (Heymann et aL., 2008b] showed that associ u∈U,x∈T ation rules may improve recall of single tag queries of URLs Atter we have the propagated weight of each tag in a URL, less than 3 and confidence 0.9, an average of 8.16 tags were dded to each URL. One may wonder whether this expansion rank the list of tags to be recommended according to these is reasonable, as adding associated tags may be"corrupting ags propagated weight. That is, the tag r with the highest the ground truth set. We will discuss this problem in the next Prop(u, r)is ranked at the first place in the list and is there fore the best candidate for recommendation, and the tag that Section. has second highest propagated weight is ranked second and to their term information. We crawled the page contents of so on. Note that for a URL already having prior tag infor- the remaining URLs, ignoring files that were dead links or mation, the recommended tags may include the tags that are not in the original set of tags annotating the URL. These er were not plain text. As the ability to deal with image con- tent is beyond the scope of tra tags work as the supplements to the maybe insufficient tag number of images but few terms will be ignored for now, information of the url even though it may be relatively easy for people to tag these ages. The number 4 Evaluation the HTML tags are removed. Any URL with fewer than In order to evaluate the proposed content-based tag recom- or more than 100, 000 terms in the corresponding content is mendation method, we performed experiments for the ex eliminated from the dataset treme case, where the URLs requiring tag recommendation As both the number of tags and the number of terms asso- have no prior tag information ciated with each URL follow the power law, the majority of the crawled URLs are filtered out in the process. As a result 4.1 The dataset the filtering process netted 85, 230 URLS from the initial set. Our goal for this step is to construct a ground truth set re- Each URL in the final da taset has on average 1084. 33 total quired for evaluating our method tags, 99.97 distinct tags, 1043 total terms, and 364. 19 distinct First of all. we need to build a set of uRls with tag infor- terms. Bookmarks on the Delicious homepage. For each user who 4.2 Prediction Accuracy by Cross validation has bookmarked one of the popular bookmarks, his/her book- The first set of experiments look into the effectiveness of marks are added to the set of harvested URLs. For each new our system in propagating the accurate tags to the untagged RL, we look for previously unseen users who have book- URLs. In our 5-fold cross validation process, we removed marked this URL, and look into his/her bookmark list for new one-fifth of the URls tags, and treated the remaining in URLS. Using this process, we crawled a total of 1,049, 580 formation(tag information for 4 folds and all the term in URLS, and collected all the tags given by any user. The user formation) as input into our system. After the propagation information is ignored for now, but may be valuable for per- phase, the URLs that had their tags removed would be repop- onalized recommendations ulated with tags, and we compare the repopulated tags with The annotation frequency of each tag is computed by tal- the ground truth for precision. Since users usually do not lying the number of users who have annotated any URL in have the patience to annotate many tags for a URL, we only the set with the tag. The top 100,000 tags in terms of anno- calculated up to precision at 10. During the training process tation frequency are selected as the tag/term vocabulary set. the only parameter we had to select is the similarity threshold ten misspelled words(e.g. wednsday)or concatenated words an. All pairs of URLs with similarity less than e are removed We observed that tags with low annotation frequency are of- (e.g. onlinedesigntools), which are not the best candidates for results with the results from term PLSA, which calculates the tag recommendation. Consequently, any tag with annotation similarity between two URLs by calculating only the cosine frequency of 15 or less is removed from the dataset similarity between their respective term vectors. Note that the The number of tags associated with each URL varies dimensions of the term vectors in our method and term plsa widely. In this dataset, documents with a total of 50 tags have have already been reduced to 500 dimensions using PLSA. an average of 17.96 distinct tags with a standard deviation of The propagation method used in the term PLSA is the same 5.18 tags To ensure that our system is able to recommend up as the one used in our method. The results of our algorithm 067provider will give to the receiver. The second part is the prob￾ability of the tag appearing in the provider. The third part takes into consideration how trustworthy the provider is. We use the total number of tags given to a URL to estimate the extent the provider can be trusted. URLs with more total tags annotated can be trusted more. Therefore, the formula is as follows: Prop(u, x ) = w∈U Sim(u, w) × Pt(w, x ) × log x∈T Ft(w, x ) u ∈ U,x ∈ T After we have the propagated weight of each tag in a URL, we can now recommend tags for this URL. For a URL u, we rank the list of tags to be recommended according to these tags’ propagated weight. That is, the tag x with the highest P rop(u, x) is ranked at the first place in the list and is there￾fore the best candidate for recommendation, and the tag that has second highest propagated weight is ranked second and so on. Note that for a URL already having prior tag infor￾mation, the recommended tags may include the tags that are not in the original set of tags annotating the URL. These ex￾tra tags work as the supplements to the maybe insufficient tag information of the URL. 4 Evaluation In order to evaluate the proposed content-based tag recom￾mendation method, we performed experiments for the ex￾treme case, where the URLs requiring tag recommendation have no prior tag information. 4.1 The Dataset Our goal for this step is to construct a ground truth set re￾quired for evaluating our method. First of all, we need to build a set of URLs with tag infor￾mation. The URL crawling process starts from the Popular Bookmarks on the Delicious homepage. For each user who has bookmarked one of the popular bookmarks, his/her book￾marks are added to the set of harvested URLs. For each new URL, we look for previously unseen users who have book￾marked this URL, and look into his/her bookmark list for new URLs. Using this process, we crawled a total of 1,049,580 URLs, and collected all the tags given by any user. The user information is ignored for now, but may be valuable for per￾sonalized recommendations. The annotation frequency of each tag is computed by tal￾lying the number of users who have annotated any URL in the set with the tag. The top 100,000 tags in terms of anno￾tation frequency are selected as the tag/term vocabulary set. We observed that tags with low annotation frequency are of￾ten misspelled words (e.g. wednsday) or concatenated words (e.g. onlinedesigntools), which are not the best candidates for tag recommendation. Consequently, any tag with annotation frequency of 15 or less is removed from the dataset. The number of tags associated with each URL varies widely. In this dataset, documents with a total of 50 tags have an average of 17.96 distinct tags with a standard deviation of 5.18 tags. To ensure that our system is able to recommend up to 10 distinct tags for each webpage based on the ground truth dataset, any URL annotated with fewer than 50 tags by all users were removed from the set. In other words, the thresh￾old is set at 50 so that most documents have more than 10 distinct tags as the ground truth. However, given that most users tend to specify just a lim￾ited number of tags, many URLs have insufficient tag infor￾mation even with the above filtering in place. In order to re￾duce the problem, the association rule algorithm [Agrawal et al., 1993] has been adopted to expand the tag set of a URL. Experiments in [Heymann et al., 2008b] showed that associ￾ation rules may improve recall of single tag queries of URLs with little tag information. By applying all rules of length less than 3 and confidence 0.9, an average of 8.16 tags were added to each URL. One may wonder whether this expansion is reasonable, as adding associated tags may be “corrupting” the ground truth set. We will discuss this problem in the next section. In the next step, all remaining URLs are filtered according to their term information. We crawled the page contents of the remaining URLs, ignoring files that were dead links or were not plain text. As the ability to deal with image con￾tent is beyond the scope of this research, URLs with a large number of images but few terms will be ignored for now, even though it may be relatively easy for people to tag these pages. The number of terms on each page is counted after the HTML tags are removed. Any URL with fewer than 50 or more than 100,000 terms in the corresponding content is eliminated from the dataset. As both the number of tags and the number of terms asso￾ciated with each URL follow the power law, the majority of the crawled URLs are filtered out in the process. As a result, the filtering process netted 85,230 URLs from the initial set. Each URL in the final da taset has on average 1084.33 total tags, 99.97 distinct tags, 1043 total terms, and 364.19 distinct terms. 4.2 Prediction Accuracy by Cross Validation The first set of experiments look into the effectiveness of our system in propagating the accurate tags to the untagged URLs. In our 5-fold cross validation process, we removed one-fifth of the URL’s tags, and treated the remaining in￾formation (tag information for 4 folds and all the term in￾formation) as input into our system. After the propagation phase, the URLs that had their tags removed would be repop￾ulated with tags, and we compare the repopulated tags with the ground truth for precision. Since users usually do not have the patience to annotate many tags for a URL, we only calculated up to precision at 10. During the training process, the only parameter we had to select is the similarity threshold . All pairs of URLs with similarity less than are removed and not used in propagation. In addition, we compared our results with the results from term PLSA, which calculates the similarity between two URLs by calculating only the cosine similarity between their respective term vectors. Note that the dimensions of the term vectors in our method and term PLSA have already been reduced to 500 dimensions using PLSA. The propagation method used in the term PLSA is the same as the one used in our method. The results of our algorithm 2067
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有