2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops User Modeling for Recommendation in Blogspace Kangmiao Liu, Wei Chen, Jiajun Bu, Chun Chen, Lijun Zhang College of Computer Science Zhejiang University Hangzhou 310027. P.R. china Lkm,chen,bjj,chen,zljzju)@zju.edu.cn Abstract Recommender systems are broadly researched widely applied in e-commerce sites (e.g Weblogs(alsoknownasblogs)havebecomeakeytoolwww.amazon.comtoassistcustomersintheirpur- not only for individuals to publish posts, but also for obtain- chasing decisions. Products can be recommended based g useful information on a daily basis. Compared with tra- on the top overall sellers on a site, on the demographics ditional Internet services, there is much more personalized of the consumer, or on an analysis of the consumer's past information in blogspace. And it is an ideal place to pro- buying behavior as a prediction for future offers. Other vide personalized services such as recommendation. Thi recommendations such as, "those who bought product X paper proposes a novel scheme to model users'interests for also bought product Y"is also prevalent [1]. However, recommendation in blogspace. It separates users'interests there is few mature recommender system in blogspace. into long-term and short-term, and models them by integrat- Some blog sites provide preliminary recommending ngtheirpreferencecategoriesreflectedinindividualblogservicessuchasBlogSina(http://blog.sina.com.cn)in post in a period of time. Interests attenuation algorithm is China. Nevertheless, most of them simply push some further introduced to model the decline of z Interests In news, scandals and the like to people without considering a specific category. Experimental results show that the pro- bloggers'preferences. Some blog sites use Google Ad sedschemecanwelldescribeusers'interestsevolvementSense(https://www.google.com/adsense/)torecommend advertisements. However, it does not take full advantage of blogs and is limited to recommendation of advertisements 1. Introduction It is much more difficult to provide recommending services in blogspace than traditional domain. It is not easy to model Weblog is a web page that serves as a publicly accessi- and represent users interests. Because the information in ble personal journal for an individual. With the dramatic blogspace is noisy despite it is abundant. What's more growth during recent years, blogs have become a preva users'interests change dynamically to keep up to date ing type of media on the Internet. According to"Chinal In this paper, we propose a novel scheme to represent abs.com,therearesixteenmillionblogsinChinawhiletheusers'interestsevolvementinblogspace.Itmaintainsa number is one hundred million around the world in Septem- long-term interest descriptor to capture the user's general ber 2005. Furthermore, more users read blogs, as around interests formed gradually over the long run, and a short- 30% of Internet users read blogs in America and French. term interest descriptor to keep track of the user's more That is, blog has a huge volume of personalized informa- recent, faster-changing interests changing on a daily basis tion and users. It is an ideal place to provide recommending Users'long-term and short-term preferences are obtained services. For example, suppose that a blogger wants to buy by analyzing their articles. Interests attenuation algorithm notebook PC and writes it in the blog. After analyzing the is proposed to model the decline of users' interests. Expe blog post and the blogger's preference, the blog system au- imental results show that users' changing can be tomatically pushes some news, ads, professional bloggers well described using the proposed scheme nd blog groups related to notebook PC to the user. This The rest service not only increases the number of blog visitors, but lated work is discussed in Section 2. Section 3 describes also creates millions of wealth for blog services providers. the scheme to model users for recommendation in detail To provide recommending services in blogspace, modeling Section 4 performs experiments to evaluate the proposed users'interests precisely is the first step scheme. Section 5 summarizes and concludes the paper 0-7695-3028-1/07S25.002007IEEE OI10.1109 WI-IATW.2007.23
User Modeling for Recommendation in Blogspace Kangmiao Liu, Wei Chen, Jiajun Bu, Chun Chen, Lijun Zhang College of Computer Science Zhejiang University Hangzhou 310027, P.R. China {lkm, chenw, bjj, chenc, zljzju}@zju.edu.cn Abstract Weblogs (also known as blogs) have become a key tool not only for individuals to publish posts, but also for obtaining useful information on a daily basis. Compared with traditional Internet services, there is much more personalized information in blogspace. And it is an ideal place to provide personalized services such as recommendation. This paper proposes a novel scheme to model users’ interests for recommendation in blogspace. It separates users’ interests into long-term and short-term, and models them by integrating their preference categories reflected in individual blog post in a period of time. Interests attenuation algorithm is further introduced to model the decline of users’ interests in a specific category. Experimental results show that the proposed scheme can well describe users’ interests evolvement. 1. Introduction Weblog is a web page that serves as a publicly accessible personal journal for an individual. With the dramatic growth during recent years, blogs have become a prevailing type of media on the Internet. According to “Chinalabs.com”, there are sixteen million blogs in China while the number is one hundred million around the world in September 2005. Furthermore, more users read blogs, as around 30% of Internet users read blogs in America and French. That is, blog has a huge volume of personalized information and users. It is an ideal place to provide recommending services. For example, suppose that a blogger wants to buy a notebook PC and writes it in the blog. After analyzing the blog post and the blogger’s preference, the blog system automatically pushes some news, ads, professional bloggers and blog groups related to notebook PC to the user. This service not only increases the number of blog visitors, but also creates millions of wealth for blog services providers. To provide recommending services in blogspace, modeling users’ interests precisely is the first step. Recommender systems are broadly researched and widely applied in e-commerce sites (e.g., www.amazon.com) to assist customers in their purchasing decisions. Products can be recommended based on the top overall sellers on a site, on the demographics of the consumer, or on an analysis of the consumer’s past buying behavior as a prediction for future offers. Other recommendations such as, “those who bought product X also bought product Y” is also prevalent [1]. However, there is few mature recommender system in blogspace. Some blog sites provide preliminary recommending services, such as Blog Sina (http://blog.sina.com.cn) in China. Nevertheless, most of them simply push some news, scandals and the like to people without considering bloggers’ preferences. Some blog sites use Google AdSense (https://www.google.com/adsense/) to recommend advertisements. However, it does not take full advantage of blogs and is limited to recommendation of advertisements. It is much more difficult to provide recommending services in blogspace than traditional domain. It is not easy to model and represent users interests. Because the information in blogspace is noisy despite it is abundant. What’s more, users’ interests change dynamically to keep up to date. In this paper, we propose a novel scheme to represent users’ interests evolvement in blogspace. It maintains a long-term interest descriptor to capture the user’s general interests formed gradually over the long run, and a shortterm interest descriptor to keep track of the user’s more recent, faster-changing interests changing on a daily basis. Users’ long-term and short-term preferences are obtained by analyzing their articles. Interests attenuation algorithm is proposed to model the decline of users’ interests. Experimental results show that users’ interests’ changing can be well described using the proposed scheme. The rest of the paper is organized as follow. Related work is discussed in Section 2. Section 3 describes the scheme to model users for recommendation in detail. Section 4 performs experiments to evaluate the proposed scheme. Section 5 summarizes and concludes the paper. 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops 0-7695-3028-1/07 $25.00 © 2007 IEEE DOI 10.1109/WI-IATW.2007.23 79
2. Related work high precision obtaining user's coarse-granularity prefer- ences(e. g, somebody cares sports or Economy). In this The popularity of blogs is accompanied by creasing paper, we use text classification methods to analysis blog interests from research and industrial communities. Ongo- gers interests at the level of individual post, and generate ing research mainly focuses on blog content analysis, user user's short-term interests by combining users'each inter- immunities and blog searches. Current content analysis ests gained from individual post in a short period of time. work includes bloggers moods decomposition [2, 31, bl Users' interests in some aspect may disappear or reduce due gers concerns(4, 51, classification of blogs (61, topics ex- to the physiological reasons, interests attenuation algorithm traction [5] and so on. It indicates that compared with is introduced to model this phenomena traditional media such as online news sources and public Long-term interests can be obtained by analyzing websites maintained by companies, blogs mainly have two ter information, user group, user feedback, and blog posts unique characteristics: (1)they are mainly maintained by at a long period of time in blogspace. What's more, some individuals and thus the contents are generally personal, and of the short-term interests may convert to long term. Due (2)the link structures between blogs generally form local- to the lack of these information in our initial work, blog ized communities [6, 7) ger's long-term interests are modeled based on short-term A variety of techniques have been proposed for perform- interests for simplicity in this paper. ing recommendation, including content-based, collabora- tive, knowledge-based and other techniques [8). All of 3.2 Obtaining users'interests from indi the known recommendation techniques have strengths and vIal al blog weaknesses. By far, most of these recommending tech- niques used in learning user needs and interests focus on A number of statistical classification and machine learn- users information-seeking behavior. However, modeling ing techniques have been applied to text classification. In user in a dynamic environment like blogspace is not the case this paper, Support Vector Machine(SVM) are employed to and involves more challenges [9] lassify blog post. SVM is a powerful supervised learning To the best of our knowledge, very few published algorithm developed by Vapnik [10]. It has been success- work exists about user modeling for recommendation in fully applied to text classification and performs very well blogspace at present All blog posts are preprocessed before classifying, includ ing stopword removing, dimensionality reduction and so on. 3. Modeling users for recommendation in Each blog post associates with a timestamp. To obtain users interests at time t. text classification methods are blogs ace used to assign blog post written at t to one or more pre- defined categories based on their content. Formally, let . In this section, we present the basic ideas of user model- m be the number of blog posts in somebody's blogspace for recommendation in blogspace. d=(d1, d2,.. dm) is the set of all blog posts, corre sponding with the timestamps T=(t1, t2, ., tm).The 3. 1 General ideas predefined categories A probability wi >=0 is associated with each blog post Users usually have long-term and short-term interests di which is classified into a category c.Further, we can 91. Long-term interests represent user's general prefer- represent users'interests obtained from di at time ti as ences, they are formed gradually over the long run, and are Ui=(i1, w:2, .,Win) airly stable after they converge. Consequently, long-term interests tend to be inert, and the time it takes to change 3.3 Gaining users'preference the long-term interests could be proportional to the time it takes to build them. On the other hand. short-term interests Users interests attenuate along with the passage of time. e very unstable by nature, it changes on a daily basis. For We propose a interest attenuation algorithm to model the Olympic Games. After it, they lose interests in sporty e of decline of users'interests. Interest attenuation introduc example, users may pay attention to sports in the ti attenuation factor. defined as follow: Users write blogs in a daily basis, so it is ideal place log(i-teu to obtain bloggers'short-term interests by analyzing blog posts. As we know, it 's difficult to know user's fine ranularity interests by current techniques(e.g, whether teur represents current time, ti is timestamp of blog post somebody likes football star ippo Inzaghi). How- di. hl denotes the half life, that is, the amount of time it ever, state of the art methods in text classification achieve takes for half of users'interest. We simply assume half life
2. Related work The popularity of blogs is accompanied by increasing interests from research and industrial communities. Ongoing research mainly focuses on blog content analysis, user communities and blog searches. Current content analysis work includes bloggers moods decomposition [2, 3], bloggers concerns [4, 5], classification of blogs [6], topics extraction [5] and so on. It indicates that compared with traditional media such as online news sources and public websites maintained by companies, blogs mainly have two unique characteristics: (1) they are mainly maintained by individuals and thus the contents are generally personal, and (2) the link structures between blogs generally form localized communities [6, 7]. A variety of techniques have been proposed for performing recommendation, including content-based, collaborative, knowledge-based and other techniques [8]. All of the known recommendation techniques have strengths and weaknesses. By far, most of these recommending techniques used in learning user needs and interests focus on users information-seeking behavior. However, modeling user in a dynamic environment like blogspace is not the case and involves more challenges [9]. To the best of our knowledge, very few published work exists about user modeling for recommendation in blogspace at present. 3. Modeling users for recommendation in blogspace In this section, we present the basic ideas of user modeling for recommendation in blogspace. 3.1 General ideas Users usually have long-term and short-term interests [9]. Long-term interests represent user’s general preferences, they are formed gradually over the long run, and are fairly stable after they converge. Consequently, long-term interests tend to be inert, and the time it takes to change the long-term interests could be proportional to the time it takes to build them. On the other hand, short-term interests are very unstable by nature, it changes on a daily basis. For example, users may pay attention to sports in the time of Olympic Games. After it, they lose interests in sports. Users write blogs in a daily basis, so it is ideal place to obtain bloggers’ short-term interests by analyzing blog posts. As we know, it’s difficult to know user’s finegranularity interests by current techniques(e.g., whether somebody likes football star Filippo Inzaghi). However, state of the art methods in text classification achieve high precision obtaining user’s coarse-granularity preferences(e.g., somebody cares sports or Economy). In this paper, we use text classification methods to analysis bloggers’ interests at the level of individual post, and generate user’s short-term interests by combining users’ each interests gained from individual post in a short period of time. Users’ interests in some aspect may disappear or reduce due to the physiological reasons, interests attenuation algorithm is introduced to model this phenomena. Long-term interests can be obtained by analyzing register information, user group, user feedback, and blog posts at a long period of time in blogspace. What’s more, some of the short-term interests may convert to long term. Due to the lack of these information in our initial work, blogger’s long-term interests are modeled based on short-term interests for simplicity in this paper. 3.2 Obtaining users’ interests from individual blog post A number of statistical classification and machine learning techniques have been applied to text classification. In this paper, Support Vector Machine (SVM) are employed to classify blog post. SVM is a powerful supervised learning algorithm developed by Vapnik [10]. It has been successfully applied to text classification and performs very well. All blog posts are preprocessed before classifying, including stopword removing, dimensionality reduction and so on. Each blog post associates with a timestamp. To obtain users’ interests at time t, text classification methods are used to assign blog post written at t to one or more predefined categories based on their content. Formally, let m be the number of blog posts in somebody’s blogspace. D = (d1, d2, ..., dm) is the set of all blog posts, corresponding with the timestamps T = (t1, t2, ..., tm). The predefined categories are defined as C = (c1, c2, ..., cn). A probability wij >= 0 is associated with each blog post di which is classified into a category cj . Further, we can represent users’ interests obtained from di at time ti as Ui = (wi1, wi2, ..., win). 3.3 Gaining users’ preference Users’ interests attenuate along with the passage of time. We propose a interest attenuation algorithm to model the decline of users’ interests. Interest attenuation introduces attenuation factor, defined as follow: fi = e− log(ti−tcur) 2 hl , tcur represents current time, ti is timestamp of blog post di. hl denotes the half life, that is, the amount of time it takes for half of users’ interest. We simply assume half life 80
of different interest categories are the same. Then, current and user's long-term interests can be model as users'interests associated with di is U;=(ua1*f1,2*f,…,Ulin*f1) There is also a threshold K limits the number timestamps Users’ current of short term interets k 4. Experimental evaluation that is, accumulating users'interest in categories cj of each experiment, we collected two publicly available data sets posts. Then, users' interests can be model as from the web, which are in Chinese. We call them So gouC and HiBaidu in following disccusion. SogouC is pro- Ucur=(ucur 1, ucur videdbySogouLabs(http://www.sogou.com/labs/),itin- Intuitively, it take more time for long-term interests at cludes 17, 910 web pages labeled with nine categories(IT tenuating than short-term. That is, given half life of long Economy, Health, Education, Military, Travel, Sport, Cul term(hflong)and short-term(hfshort), hflong hfshort. ture, Recruitment), each have 1,990 documents.Because it is a heavy work to build labeled blog data set, SogouC Before going on, we define fi and short which repre- is used to training the classification algorithm in this paper. sent attenuation factor of long-term and short-term interests respectively in the following discussion HiBaidu is blog posts archived based on individuals from To obtain users'interests, it is not necessary to make use BaiduSpace(http://hi.baidu.com/),whichisafamousblog ite in China of all blog posts. It is not only time-consuming, but also can not precisely model users' interests, especially for users ogouC is firstly divided into a training and a test hort-term interests. We define two thresholds Tth and Nth ach category has 1,330 training documents and Only these blog posts written after Tth ago are considered. 660 test documents. All documents were preprocessed What's more, if the number of posts greater than Nth in the before training. Html tags were removed, then ICT- interval, only randomly selected Nth posts in the interval Clas(http://www.nlp.org.cn/)wasusedtodoChinese are taken into account. Given t is oldest timestamp fulfill- Word Segmentation and part-of-speech labeling. After that, g the forementioned conditions. user's current short-term ch as preposi Interests in category cj is modal and auxiliary word were removed. What's more, ICTCLAS produces many meaningless terms(e.g,more than 100 continuous#', urls), we simply filtered the terms =∑* longer than 30 bytes. Then, we used one of the most effec- tive method information gain(IG)[ll] to do dimensional and user's short-term interests can be model as ity reduction. The information gain was computed for each word of the training set and the words whose information Ushort rt short hort gain was less than certain predetermined threshold were Short-term interests mainly reflect users'current prefer We ran the classification algorithm SVM nces, they are not very stable and change quickly. How- with Sogouc using rainbow, which is a pro- ever, there are usually some stable long-term interests inside gram that performs statistical text classification short-term preferences. For instance, bloggers like sports, (http://www.cs.cmu.edu/mccallum/bow/rainbow/).Pre- they keep paying attention to information related to sports cision(Pr ) Recall( Re. and F-measure(Fl)were used for a long time. In this paper, long-term interests is gener to evaluate the classifier. Table 1 shows the result. the ated based on short-term preferences. That is, after short- categories are abbreviated. It shows that SVM provides term preferences accumulating to certain level, they turn high performance of text classification into long-term interests In user modeling experiments, we define hfshort Given a serial of short-term interests got at timestamps 10days, hflong=30days, Tth= 10days, Nth=20and Ts =(t and its corresponding interests ) Current long-term Table 1. Precision, Recall and F1 Results(% Interests in category cj Ca.IT Ec He Ed Mi Tr Sp Cu re Pr.86908890928599807 Um=∑n*∫ Re.848587809589977791 F1858788859387987984
of different interest categories are the same. Then, current users’ interests associated with di is U i = (wi1 ∗ fi, wi2 ∗ fi, ..., win ∗ fi). Users’ current interests in category cj is ucur,j = cur i=1 wij ∗ fi, that is, accumulating users’ interest in categories cj of each posts. Then, users’ interests can be model as Ucur = (ucur,1, ucur,2, ..., ucur,n). Intuitively, it take more time for long-term interests attenuating than short-term. That is, given half life of longterm (hflong) and short-term (hf short), hflong > hf short. Before going on, we define flong i and f short i which represent attenuation factor of long-term and short-term interests respectively in the following discussion. To obtain users’ interests, it is not necessary to make use of all blog posts. It is not only time-consuming, but also can not precisely model users’ interests, especially for users’ short-term interests. We define two thresholds Tth and Nth. Only these blog posts written after Tth ago are considered. What’s more, if the number of posts greater than Nth in the interval, only randomly selected Nth posts in the interval are taken into account. Given t is oldest timestamp fulfilling the forementioned conditions, user’s current short-term interests in category cj is ushort cur,j = cur i=t wij ∗ f short i . and user’s short-term interests can be model as Ushort cur = (ushort cur,1 , ushort cur,2 , ..., ushort cur,n). Short-term interests mainly reflect users’ current preferences, they are not very stable and change quickly. However, there are usually some stable long-term interests inside short-term preferences. For instance, bloggers like sports, they keep paying attention to information related to sports for a long time. In this paper, long-term interests is generated based on short-term preferences. That is, after shortterm preferences accumulating to certain level, they turn into long-term interests. Given a serial of short-term interests got at timestamps Ts = (ts1, ts2, ..., tsk) and its corresponding interests Ushort = (ushort s1 , ushort s2 , ..., ushort sk ). Current long-term interests in category cj is Ulong cur,j = sk i=s1 ushort si,j ∗ flong i . and user’s long-term interests can be model as Ulong cur = (ulong cur,1, ulong cur,2, ..., ulong cur,n). There is also a threshold K limits the number timestamps of short term interets k. 4. Experimental evaluation Since there are no standard dataset on blogs, in this experiment, we collected two publicly available data sets from the Web, which are in Chinese. We call them SogouC and HiBaidu in following disccusion. SogouC is provided by Sogou Labs (http://www.sogou.com/labs/), it includes 17,910 web pages labeled with nine categories (IT, Economy, Health, Education, Military, Travel, Sport, Culture, Recruitment), each have 1,990 documents. Because it is a heavy work to build labeled blog data set, SogouC is used to training the classification algorithm in this paper. HiBaidu is blog posts archived based on individuals from Baidu Space(http://hi.baidu.com/), which is a famous blog site in China. SogouC is firstly divided into a training and a test set, each category has 1,330 training documents and 660 test documents. All documents were preprocessed before training. Html tags were removed, then ICTCLAS(http://www.nlp.org.cn/) was used to do Chinese Word Segmentation and part-of-speech labeling. After that, all stopwords such as preposition, quantifier, punctuation, modal and auxiliary word were removed. What’s more, ICTCLAS produces many meaningless terms (e.g., more than 100 continuous ’#’, urls), we simply filtered the terms longer than 30 bytes. Then, we used one of the most effective method information gain(IG) [11] to do dimensionality reduction. The information gain was computed for each word of the training set and the words whose information gain was less than certain predetermined threshold were removed. We ran the classification algorithm SVM with SogouC using rainbow, which is a program that performs statistical text classification (http://www.cs.cmu.edu/mccallum/bow/rainbow/). Precision(Pr.), Recall(Re.) and F-measure(F1) were used to evaluate the classifier. Table 1 shows the result. The categories are abbreviated. It shows that SVM provides high performance of text classification. In user modeling experiments, we define hf short = 10days, hflong = 30days, Tth = 10days, Nth = 20 and Table 1. Precision, Recall and F1 Results(%) Ca. IT Ec He Ed Mi Tr Sp Cu Re Pr. 86 90 88 90 92 85 99 80 79 Re. 84 85 87 80 95 89 97 77 91 F1 85 87 88 85 93 87 98 79 84 81
ests got from individual posts are combined to form short Long-term Short-term Interests in Sports term interests, and they are further evolved into long-term Experimental results show that the proposed user modeling scheme well describe users' interests changing and can be further integrated into recommending services easily There are several interesting directions to extend our work. For example, the link structures between blogs may ficult, for instance, half life of different interest categories may be different. These are very interesting directions and these issues will be researched in our future work References 一 Stort-tern [1]SC. Cazella, E. Reategui, L.O.C. Alvares, the User's Opinion Relevance in Recommer Figure 1. Example of somebody,'s Interest tems, Proc. of the 12th Symp. on Multimedia transformation in Sports web,pp.71-78,20 [2]K Balog, M.d. Rijke, Decomposing Bloggers Moods, 3rd Workshop on Weblogging Ecosystem, WWw lected, only these blogspace publishing more than 10 posts 2006 every month are considered. They are analyzed by long- [3] G Mishne, Experiments with Mood Classification in term and short-term interests model proposed in this paper. Blog Posts, Ist Workshop on Stylistic Analysis of Text Experimental results show that user interests can be well for Information Access. SIGIR 2005 modeled by long-term and short-term interests model. [4] T Fukuhara, T Murayama, T Nishida, Analyzing con Figure 1 shows a blogger's interest transfor- cerns of people using Weblog articles and real world mation in sports which reflects total 370 blog temporal data, 2nd Workshop on the Weblogging posts from Nov5 2006 to Apr 8 2007 collecte Ecosystem, wWw 2005 http://hi.baidu.com/guoqingliu/blog.Hisshort-terminter[5]m.Thelwall,BloggersduringtheLondonattacksTop est in sports keeps high between Jan 14 2007 and Feb 11 information sources and topics, 3rd Workshop on th 2007. We find that he wrote a series of articles about nba Weblogging Ecosystem, Www 2006 stars at that time. At march 2007, he didnt write any [6]X C Ni, G R Xue, X. Ling, et al, Exploring in the article about sports, so his short-term interests in sports Weblog Space by Detecting Informative and Affective declined. From the figure, we can see that his long-term Articles. Proc. of the 15th International Conference on interest in sports increased gradually after Jan 14 2007 World Wide Web, Pp 281-290, 2007. and decreased when the short-term interests went down [7] Q.Z. Mei, C. Liu, H. Su, C.X. Zhai, A Probabilistic That is, user's short-term interests influences the long-term Approach to Spatiotemporal Theme Pattern Mining on interests. What's more. we find that short-term interests are Weblogs. Proc. of the 15th International Conference very unstable while long-term interests are inert on World Wide Web, pages 533-542, 2006 Based on the users' interests models, blog companies can [8]R. Burke, Hybrid Recommender Systems: Survey and provide recommending service easily by pushing news, ad Experiments, User Modeling and User-Adapted Inter- blog groups and so on associated to users' preferences action,Vol.12,No.4,pp.331-370,2002 [9] D.H. widyantoro, T.R. Loerger, J. Yen, Learning User 5 Conclusions and future work Interest Dynamics with a Three-Descriptor Represen tation,J. of the American Society for Infor. Sci. and Modeling users are beneficial for many personalized ser Tech, Vol52,No.3,pp.212-225,2001 vices in blogspace, such as recommendation system. In this [10] V. Vapnik, Principles of Risk Minimization for Learn- paper, we propose a novel scheme to model users'interests ing Theory, Advances in Neural Information Proce for recommendation in blogspace. Text classification meth ding Systems, Morgan Kaufmann, pp. 831-838, 1992. ods are employed to gain users'interests from individual [111K. Aas, L. Eikvil, Text Categorisation: A Survey, blog Then, we propos Technical report, Norwegian Computing Center, 1999 rithm to model the decline of users' interests. Users' inter-
Figure 1. Example of Somebody’s Interest transformation in Sports K = 3 by empiricism. Dozens of blogger’s posts are collected, only these blogspace publishing more than 10 posts every month are considered. They are analyzed by longterm and short-term interests model proposed in this paper. Experimental results show that user interests can be well modeled by long-term and short-term interests model. Figure 1 shows a blogger’s interest transformation in sports which reflects total 370 blog posts from Nov.5 2006 to Apr.8 2007 collected at http://hi.baidu.com/guoqingliu/blog. His short-term interest in sports keeps high between Jan.14 2007 and Feb.11 2007. We find that he wrote a series of articles about NBA stars at that time. At march 2007, he didn’t write any article about sports, so his short-term interests in sports declined. From the figure, we can see that his long-term interest in sports increased gradually after Jan.14 2007 and decreased when the short-term interests went down. That is, user’s short-term interests influences the long-term interests. What’s more, we find that short-term interests are very unstable while long-term interests are inert. Based on the users’ interests models, blog companies can provide recommending service easily by pushing news, ads, blog groups and so on associated to users’ preferences. 5. Conclusions and future work Modeling users are beneficial for many personalized services in blogspace, such as recommendation system. In this paper, we propose a novel scheme to model users’ interests for recommendation in blogspace. Text classification methods are employed to gain users’ interests from individual blog posts. Then, we propose interests attenuation algorithm to model the decline of users’ interests. Users’ interests got from individual posts are combined to form shortterm interests, and they are further evolved into long-term. Experimental results show that the proposed user modeling scheme well describe users’ interests changing and can be further integrated into recommending services easily. There are several interesting directions to extend our work. For example, the link structures between blogs may be utilized to find user groups and then model user group interests. How to obtain parameters in our experiments is dif- ficult, for instance, half life of different interest categories may be different. These are very interesting directions and these issues will be researched in our future work. References [1] S.C. Cazella, E. Reategui, L.O.C. Alvares, Applying the User’s Opinion Relevance in Recommender Systems, Proc. of the 12th Symp. on Multimedia and the Web, pp.71-78, 2006. [2] K.Balog, M.d. Rijke, Decomposing Bloggers Moods, 3rd Workshop on Weblogging Ecosystem, WWW 2006. [3] G. Mishne, Experiments with Mood Classification in Blog Posts, 1st Workshop on Stylistic Analysis of Text for Information Access, SIGIR 2005. [4] T.Fukuhara, T.Murayama, T.Nishida, Analyzing concerns of people using Weblog articles and real world temporal data, 2nd Workshop on the Weblogging Ecosystem, WWW 2005. [5] M. Thelwall, Bloggers during the London attacks: Top information sources and topics, 3rd Workshop on the Weblogging Ecosystem, WWW 2006. [6] X.C Ni, G.R Xue, X. Ling, et al, Exploring in the Weblog Space by Detecting Informative and Affective Articles, Proc. of the 15th International Conference on World Wide Web, pp.281-290, 2007. [7] Q.Z. Mei, C. Liu, H. Su, C.X. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, Proc. of the 15th International Conference on World Wide Web, pages 533-542, 2006. [8] R.Burke, Hybrid Recommender Systems: Survey and Experiments, User Modeling and User-Adapted Interaction, Vol. 12, No. 4, pp. 331-370, 2002. [9] D.H. Widyantoro, T.R. Loerger, J. Yen, Learning User Interest Dynamics with a Three-Descriptor Representation, J. of the American Society for Infor. Sci. and Tech., Vol 52, No.3, pp.212-225, 2001. [10] V. Vapnik, Principles of Risk Minimization for Learning Theory, Advances in Neural Information Processding Systems, Morgan Kaufmann, pp. 831-838, 1992. [11] K. Aas, L. Eikvil, Text Categorisation: A Survey, Technical report, Norwegian Computing Center, 1999. 82