正在加载图片...
of different interest categories are the same. Then, current and user's long-term interests can be model as users'interests associated with di is U;=(ua1*f1,2*f,…,Ulin*f1) There is also a threshold K limits the number timestamps Users’ current of short term interets k 4. Experimental evaluation that is, accumulating users'interest in categories cj of each experiment, we collected two publicly available data sets posts. Then, users' interests can be model as from the web, which are in Chinese. We call them So gouC and HiBaidu in following disccusion. SogouC is pro- Ucur=(ucur 1, ucur videdbySogouLabs(http://www.sogou.com/labs/),itin- Intuitively, it take more time for long-term interests at cludes 17, 910 web pages labeled with nine categories(IT tenuating than short-term. That is, given half life of long Economy, Health, Education, Military, Travel, Sport, Cul term(hflong)and short-term(hfshort), hflong hfshort. ture, Recruitment), each have 1,990 documents.Because it is a heavy work to build labeled blog data set, SogouC Before going on, we define fi and short which repre- is used to training the classification algorithm in this paper. sent attenuation factor of long-term and short-term interests respectively in the following discussion HiBaidu is blog posts archived based on individuals from To obtain users'interests, it is not necessary to make use BaiduSpace(http://hi.baidu.com/),whichisafamousblog ite in China of all blog posts. It is not only time-consuming, but also can not precisely model users' interests, especially for users ogouC is firstly divided into a training and a test hort-term interests. We define two thresholds Tth and Nth ach category has 1,330 training documents and Only these blog posts written after Tth ago are considered. 660 test documents. All documents were preprocessed What's more, if the number of posts greater than Nth in the before training. Html tags were removed, then ICT- interval, only randomly selected Nth posts in the interval Clas(http://www.nlp.org.cn/)wasusedtodoChinese are taken into account. Given t is oldest timestamp fulfill- Word Segmentation and part-of-speech labeling. After that, g the forementioned conditions. user's current short-term ch as preposi Interests in category cj is modal and auxiliary word were removed. What's more, ICTCLAS produces many meaningless terms(e.g,more than 100 continuous#', urls), we simply filtered the terms =∑* longer than 30 bytes. Then, we used one of the most effec- tive method information gain(IG)[ll] to do dimensional and user's short-term interests can be model as ity reduction. The information gain was computed for each word of the training set and the words whose information Ushort rt short hort gain was less than certain predetermined threshold were Short-term interests mainly reflect users'current prefer We ran the classification algorithm SVM nces, they are not very stable and change quickly. How- with Sogouc using rainbow, which is a pro- ever, there are usually some stable long-term interests inside gram that performs statistical text classification short-term preferences. For instance, bloggers like sports, (http://www.cs.cmu.edu/mccallum/bow/rainbow/).Pre- they keep paying attention to information related to sports cision(Pr ) Recall( Re. and F-measure(Fl)were used for a long time. In this paper, long-term interests is gener to evaluate the classifier. Table 1 shows the result. the ated based on short-term preferences. That is, after short- categories are abbreviated. It shows that SVM provides term preferences accumulating to certain level, they turn high performance of text classification into long-term interests In user modeling experiments, we define hfshort Given a serial of short-term interests got at timestamps 10days, hflong=30days, Tth= 10days, Nth=20and Ts =(t and its corresponding interests ) Current long-term Table 1. Precision, Recall and F1 Results(% Interests in category cj Ca.IT Ec He Ed Mi Tr Sp Cu re Pr.86908890928599807 Um=∑n*∫ Re.848587809589977791 F1858788859387987984of different interest categories are the same. Then, current users’ interests associated with di is U i = (wi1 ∗ fi, wi2 ∗ fi, ..., win ∗ fi). Users’ current interests in category cj is ucur,j = cur i=1 wij ∗ fi, that is, accumulating users’ interest in categories cj of each posts. Then, users’ interests can be model as Ucur = (ucur,1, ucur,2, ..., ucur,n). Intuitively, it take more time for long-term interests at￾tenuating than short-term. That is, given half life of long￾term (hflong) and short-term (hf short), hflong > hf short. Before going on, we define flong i and f short i which repre￾sent attenuation factor of long-term and short-term interests respectively in the following discussion. To obtain users’ interests, it is not necessary to make use of all blog posts. It is not only time-consuming, but also can not precisely model users’ interests, especially for users’ short-term interests. We define two thresholds Tth and Nth. Only these blog posts written after Tth ago are considered. What’s more, if the number of posts greater than Nth in the interval, only randomly selected Nth posts in the interval are taken into account. Given t is oldest timestamp fulfill￾ing the forementioned conditions, user’s current short-term interests in category cj is ushort cur,j = cur i=t wij ∗ f short i . and user’s short-term interests can be model as Ushort cur = (ushort cur,1 , ushort cur,2 , ..., ushort cur,n). Short-term interests mainly reflect users’ current prefer￾ences, they are not very stable and change quickly. How￾ever, there are usually some stable long-term interests inside short-term preferences. For instance, bloggers like sports, they keep paying attention to information related to sports for a long time. In this paper, long-term interests is gener￾ated based on short-term preferences. That is, after short￾term preferences accumulating to certain level, they turn into long-term interests. Given a serial of short-term interests got at timestamps Ts = (ts1, ts2, ..., tsk) and its corresponding interests Ushort = (ushort s1 , ushort s2 , ..., ushort sk ). Current long-term interests in category cj is Ulong cur,j =  sk i=s1 ushort si,j ∗ flong i . and user’s long-term interests can be model as Ulong cur = (ulong cur,1, ulong cur,2, ..., ulong cur,n). There is also a threshold K limits the number timestamps of short term interets k. 4. Experimental evaluation Since there are no standard dataset on blogs, in this experiment, we collected two publicly available data sets from the Web, which are in Chinese. We call them So￾gouC and HiBaidu in following disccusion. SogouC is pro￾vided by Sogou Labs (http://www.sogou.com/labs/), it in￾cludes 17,910 web pages labeled with nine categories (IT, Economy, Health, Education, Military, Travel, Sport, Cul￾ture, Recruitment), each have 1,990 documents. Because it is a heavy work to build labeled blog data set, SogouC is used to training the classification algorithm in this paper. HiBaidu is blog posts archived based on individuals from Baidu Space(http://hi.baidu.com/), which is a famous blog site in China. SogouC is firstly divided into a training and a test set, each category has 1,330 training documents and 660 test documents. All documents were preprocessed before training. Html tags were removed, then ICT￾CLAS(http://www.nlp.org.cn/) was used to do Chinese Word Segmentation and part-of-speech labeling. After that, all stopwords such as preposition, quantifier, punctuation, modal and auxiliary word were removed. What’s more, ICTCLAS produces many meaningless terms (e.g., more than 100 continuous ’#’, urls), we simply filtered the terms longer than 30 bytes. Then, we used one of the most effec￾tive method information gain(IG) [11] to do dimensional￾ity reduction. The information gain was computed for each word of the training set and the words whose information gain was less than certain predetermined threshold were re￾moved. We ran the classification algorithm SVM with SogouC using rainbow, which is a pro￾gram that performs statistical text classification (http://www.cs.cmu.edu/mccallum/bow/rainbow/). Pre￾cision(Pr.), Recall(Re.) and F-measure(F1) were used to evaluate the classifier. Table 1 shows the result. The categories are abbreviated. It shows that SVM provides high performance of text classification. In user modeling experiments, we define hf short = 10days, hflong = 30days, Tth = 10days, Nth = 20 and Table 1. Precision, Recall and F1 Results(%) Ca. IT Ec He Ed Mi Tr Sp Cu Re Pr. 86 90 88 90 92 85 99 80 79 Re. 84 85 87 80 95 89 97 77 91 F1 85 87 88 85 93 87 98 79 84 81
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有