正在加载图片...
we observe the optimal iteration number is 480. So we the data sparsity. Rating data sparsity is always exis- think the optimal iteration number for them is 340 and tent and thus there are possibilities that our algorithm 480, respectively. And we use these two parameters in cannot find enough consultants for potential rating in- ference when the topic number is set relatively large One probable case is like this: the neighbors found by 20 Percent as Training 96 50 Percent as Training set dict, but the user has not given a rating to it sis, the optimal topic number is 23 as figure 3 illustrates. The situation here is similar 小r2 with that in item-tag analysis. The local minimum is not the global one. The reason for this is the same as mentioned before. What is different is the stable dura- tion here is shorter than that in item-tag analysis and 9 Percent as Training Set the fluctuation here is more obvious. This observation can be explained by the diversity of personal interests Compared with movies, the attributes of human beings 000 are more dynamic and diverse. It is easier to find sim- ilar items than similar people because the measure in the latter situation Is vaguer. n5005 In summary, the optimal topic number is around 25 in 00102 Number of Top both two situations. which means the results are con- sistent. And the genre number in common use is the Fi igure 2. the topic number and results are reasonable. But we must emphasize the fact the prediction accuracy for the item analysis that our method of topic finding and the common use of genre classification focus on different targets and thus 20 Percent as Training Set 50 Percent as Tr produce different results. Considering the fact that the 0,9 information of genre demands the knowledge from ex- perts, our method of topic finding has wider range of Discussion There are some issues concerning implemental details we need to explain her. Because tags are given with high freedom, there are a lot of preprocessing work to 0.910 do. First, there are many noise such as" D"in the data We absolutely should remove them all. But in fact, we are unable to guarantee all noise are cleared off because they lack rules. Second, stoplist is one of the most im- portant parts to manipulate. To our best knowledge most stoplists used in document clustering remove the word“good”and“ great”. Concerning Number of Top Number of Topics these words reflect the users' preference to the items words because it may benefit rating prediction. In our Dependence between the topic number and tion accuracy for the users'tag analysis experiments, we remove the prepositions, conjunctions and other less meaningful words while leave emotional Compared with the iteration number, we are more in- words untouched. Third, stemming is also a compli- terested in the dependence between the topic number cated technique that we must employ. It may be sim- and the prediction accuracy. We record the effects the prediction preciseness for the topic numbers rang he bags of tags in our experiment are rather chaotic We are worried the stemming algorithm may to some optimal topic number for the items' tag analysis is less extent have a negative influence on the quality of topic the global optimal value. Although just local optimum, should further improve the quality of recommendation the rmse value near the topic number of 25 is stably o some extent ess than 0.9. The observation that there are some bet ter parameter choices less than 10 can be explained by CONCLUSIONS AND FUTURE WORKwe observe the optimal iteration number is 480. So we think the optimal iteration number for them is 340 and 480, respectively. And we use these two parameters in the above experiments. Figure 2. Dependence between the topic number and the prediction accuracy for the items’ tag analysis Figure 3. Dependence between the topic number and the prediction accuracy for the users’ tag analysis Compared with the iteration number, we are more in￾terested in the dependence between the topic number and the prediction accuracy. We record the effects to the prediction preciseness for the topic numbers rang￾ing between (1, 50). From Figure 2 we observe that the optimal topic number for the items’ tag analysis is less than 25. The optimal value does not necessarily mean the global optimal value. Although just local optimum, the RMSE value near the topic number of 25 is stably less than 0.9. The observation that there are some bet￾ter parameter choices less than 10 can be explained by the data sparsity. Rating data sparsity is always exis￾tent and thus there are possibilities that our algorithm cannot find enough consultants for potential rating in￾ference when the topic number is set relatively large. One probable case is like this: the neighbors found by our approach is very similar to the movie we are to pre￾dict, but the user has not given a rating to it. For user-tag analysis, the optimal topic number is 23 as Figure 3 illustrates. The situation here is similar with that in item-tag analysis. The local minimum is not the global one. The reason for this is the same as mentioned before. What is different is the stable dura￾tion here is shorter than that in item-tag analysis and the fluctuation here is more obvious. This observation can be explained by the diversity of personal interests. Compared with movies, the attributes of human beings are more dynamic and diverse. It is easier to find sim￾ilar items than similar people because the measure in the latter situation is vaguer. In summary, the optimal topic number is around 25 in both two situations, which means the results are con￾sistent. And the genre number in common use is the same order of magnitude. From this perspective, our results are reasonable. But we must emphasize the fact that our method of topic finding and the common use of genre classification focus on different targets and thus produce different results. Considering the fact that the information of genre demands the knowledge from ex￾perts, our method of topic finding has wider range of application. Discussion There are some issues concerning implemental details we need to explain her. Because tags are given with high freedom, there are a lot of preprocessing work to do. First, there are many noise such as “:D” in the data. We absolutely should remove them all. But in fact, we are unable to guarantee all noise are cleared off because they lack rules. Second, stoplist is one of the most im￾portant parts to manipulate. To our best knowledge, most stoplists used in document clustering remove the word “good” and “great”. Concerning our approach, these words reflect the users’ preference to the items. It is somehow meaningful. We hesitate to remove these words because it may benefit rating prediction. In our experiments, we remove the prepositions, conjunctions and other less meaningful words while leave emotional words untouched. Third, stemming is also a compli￾cated technique that we must employ. It may be sim￾pler for ordinary document and webpage retrieval. But the bags of tags in our experiment are rather chaotic. We are worried the stemming algorithm may to some extent have a negative influence on the quality of topic finding. If we take better care of these three factors, we should further improve the quality of recommendation to some extent. CONCLUSIONS AND FUTURE WORK
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有