Information Sciences 181(2011)1552-1572 Contents lists available at Science Direct Information sciences ELSEVIER journalhomepagewww.elsevier.com/locate/ins Personalized recommendation of popular blog article for mobile applications Duen-Ren liu. Pei-Yun tsai. Po-Huan Chiu Institute of Information Management, National Chiao Tung University, Hsinchu, Taiwan ARTICLE INFO A BSTRACT Article history: logs have emerged as a new communication and publication medium on the Internet eived 5 october 2009 for diffusing the latest useful information. Providing value-added mobile services, such as eived in revised form 22 October 2010 blog articles, is increasingly important to attract mobile users to mobile commerce, i Available online 9 January 2011 order to benefit from the proliferation and convenience of using mobile devices to receive information any time and anywhere. However, there are a tremendous number of blog arti- les, and mobile users generally have difficulty in browsing weblogs owing to the limita tions of mobile devices. Accordingly, providing mobile users with blog articles that suit heir particular interests is an important issue. Very little research, however, has focused on this issue Collaborative filtering In this work, we propose a novel Customized Content Service on a mobile device(m-CCS to filter and push blog articles to mobile users. The m-CCS includes a novel forecasting approach to predict the latest popular blog topics based on the trend of time-sensitive pop- ularity of weblogs. Mobile users may, however, have different interests regarding the latest popular blog topics. Thus, the m-CCS further analyzes the mobile users' browsing logs to determine their interests, which are then combined with the latest popular blog topics to derive their preferred blog topics and articles. A novel hybrid approach is proposed to recommend blog articles by integrating personalized popularity of topic clusters, item- based collaborative filtering(CF)and attention degree(click times) of blog articles. The experiment result demonstrates that the m-CCS system can effectively recommend mobile users'desired blog articles with respect to both popularity and personal interests. e 2011 Elsevier Inc. All rights reserved. 1 Introduction Weblogs have emerged as a new communication and publication medium on the Internet for diffusing the latest useful information. Blog articles represent the opinions of the populace and constitute a reaction to current events(e. g, news )on the Internet [13 Accordingly, looking for the latest popular issues discussed by blogs and attracting readers attention is an interesting subject. Moreover, providing value-added mobile services, such as blog articles, is increasingly important to attract mobile users to mobile commerce, in order to benefit from the proliferation and convenience of using mobile devices to receive information anytime and anywhere. There are however, a tremendous number of blog articles, and mobile users generally have difficulty in browsing weblogs owing to the inherent limitations of mobile devices, such as small screens, short usage time and poor input mechanisms. Accordingly, providing mobile users with blog articles that suit their interests nportant issue. Very little research, however, has focused on this issue. onding author. tel:+88635131245;fax:+88635723792. du tw. dliueiim. nctu. edu. tw(D -R Liu). 0020-0255/S-see front matter o 2011 Elsevier Inc. All rights reserved. doi:10.1016ins2011.01.005
Personalized recommendation of popular blog articles for mobile applications Duen-Ren Liu ⇑ , Pei-Yun Tsai, Po-Huan Chiu Institute of Information Management, National Chiao Tung University, Hsinchu, Taiwan article info Article history: Received 5 October 2009 Received in revised form 22 October 2010 Accepted 1 January 2011 Available online 9 January 2011 Keywords: Mobile service Blog recommenders Time-sensitive topic Collaborative filtering abstract Weblogs have emerged as a new communication and publication medium on the Internet for diffusing the latest useful information. Providing value-added mobile services, such as blog articles, is increasingly important to attract mobile users to mobile commerce, in order to benefit from the proliferation and convenience of using mobile devices to receive information any time and anywhere. However, there are a tremendous number of blog articles, and mobile users generally have difficulty in browsing weblogs owing to the limitations of mobile devices. Accordingly, providing mobile users with blog articles that suit their particular interests is an important issue. Very little research, however, has focused on this issue. In this work, we propose a novel Customized Content Service on a mobile device (m-CCS) to filter and push blog articles to mobile users. The m-CCS includes a novel forecasting approach to predict the latest popular blog topics based on the trend of time-sensitive popularity of weblogs. Mobile users may, however, have different interests regarding the latest popular blog topics. Thus, the m-CCS further analyzes the mobile users’ browsing logs to determine their interests, which are then combined with the latest popular blog topics to derive their preferred blog topics and articles. A novel hybrid approach is proposed to recommend blog articles by integrating personalized popularity of topic clusters, itembased collaborative filtering (CF) and attention degree (click times) of blog articles. The experiment result demonstrates that the m-CCS system can effectively recommend mobile users’ desired blog articles with respect to both popularity and personal interests. 2011 Elsevier Inc. All rights reserved. 1. Introduction Weblogs have emerged as a new communication and publication medium on the Internet for diffusing the latest useful information. Blog articles represent the opinions of the populace and constitute a reaction to current events (e.g., news) on the Internet [13]. Accordingly, looking for the latest popular issues discussed by blogs and attracting readers’ attention is an interesting subject. Moreover, providing value-added mobile services, such as blog articles, is increasingly important to attract mobile users to mobile commerce, in order to benefit from the proliferation and convenience of using mobile devices to receive information anytime and anywhere. There are, however, a tremendous number of blog articles, and mobile users generally have difficulty in browsing weblogs owing to the inherent limitations of mobile devices, such as small screens, short usage time and poor input mechanisms. Accordingly, providing mobile users with blog articles that suit their interests is an important issue. Very little research, however, has focused on this issue. 0020-0255/$ - see front matter 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2011.01.005 ⇑ Corresponding author. Tel.: +886 3 5131245; fax: +886 3 5723792. E-mail addresses: dliu@mail.nctu.edu.tw, dliu@iim.nctu.edu.tw (D.-R. Liu). Information Sciences 181 (2011) 1552–1572 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins
D-R. Liu et aL/ Information Sciences 181(2011)1552-1572 There are three main types of research regarding blogs. The first type of research focuses on analyzing the link structure between blogs to form a community [19, 20]. Through the hyperlinks between blogs, people can communicate across blogs by publishing content related to other blogs. Nakajima et al. [31 proposed a method to identify the important bloggers in the conversations, based on their roles in preceding blog threads, and identify"hot "conversation. The second type of research focuses on content analysis to derive the propagation of topics and trends in the blogsphere. Gruhl et al. [11, 12] modeled the information propagation of topics among blogs based on blog text. With the analysis of tracking topic and user drift, Hayes et al. [13] examined the relationship between blogs over time. Mei et al. [28 proposed a method to discover the distributions and evolution patterns across time and space. Although existing studies have investigated the evolution of blog topics, they have not considered how to predict the degree of popularity of blog topics. The last type of research focuses on how to model the bloggers and derive their interests in order to generate personal recommendations [38.40]. A variety of methods has been proposed to model the bloggers interests and provide recommended content which is similar to their earlier experi ences[15, 24. te The majority of previous studies on blogs have ignored the hot topics and popular articles discussed by mass groups of aders, who engage in browsing actions related to the blog articles. Moreover, existing studies do not consider recommend ng blog articles to mobile readers in mobile environments. with more and more blog articles continually being published on the Internet, the scale and complexity of blog contents are growing rapidly, resulting in information overload for blog read- ers Mobile readers could only browse a very limited number of blog articles because of the restrictions of mobile devices. Accordingly, traditional recommendation methods, such as the collaborative filtering approach [1, 2, 5, 17, 25, 35]. may suffer the sparsity problem of finding similar users or items due to insufficient historical records of browsing blog articles by mo- bile readers. To address the sparsity issue and blog information overload. it is essential to design an appropriate mechanism for recommending blog articles in mobile environments. Blog readers are often interested in browsing emerging and popular blog topics, from which the popularity of blogs can be inferred according to the accumulated click times on blogs. Popularity based solely on click times, however, cannot truly reflect popularity trends. For example, a new event may trigger emerging discussions such that the number of related blog articles and browsing actions is small at the beginning and rapidly increases as time goes on. Thus, it is important to analyze the trend of time-sensitive popularity of blogs to predict emerging hot blog topics. In addition, blog readers may have different interests regarding the emerging popular blog topics. Nevertheless, exist- ing researches have not addressed such issues of how to predict the popularity trend of blog topics and personalized popular More specifically, several studies have been proposed to model the blogger's interest and provide personal recommenda tions [15, 24, 38, 40. Traditional approaches of recommender systems can also be adopted to recommend blog articles to mo- bile users. However, existing researches have not addressed the issue of recommending personalized popular blog articles which is especially important for mobile environments where mobile users can not freely browse a tremendous amount of blog articles on the Internet due to the restriction of mobile devices, and therefore must rely on service providers'recom- endations to browse a small and feasible subset of blog articles. many blog articles are new articles to the system, since hey have not been viewed by any mobile user in the system due to the limitation of mobile devices. Traditional recommen- dation methods may suffer from the new item problem, in which there is no record on new items by which to deriv the prediction [1. It means that most new articles, which are popular on the Internet and to which the masses of Internet users pay attention, may be ignored by conventional recommendation methods. Accordingly, the recommended feasible set of blog articles should contain those articles which are new articles to the system but are popular with Internet users and also suit mobile users' personal interests. Existing recommendation approaches have neither addressed such issues nor con- sidered the popularity degree of blog articles. In this work, we propose a novel Customized Content Service on a mobile device(m-CCS)to recommend personalized and popular blog articles to mobile users. Conventional recommender systems mainly employ the users'behavior logs recorded in the systems to make recommendations. Differing from existing recommender systems, we use an additional data source collected from the Internet, 1. e, the Internet users' click times on blog articles, to identify the popularity degree of blog arti- cles which are integrated with recommendation approach to improve the recommender quality in mobile recommender ervices First, we propose a novel approach to predict the trend of time-sensitive popularity of blog topics. We analyze blog con- tents retrieved by co-RSS to derive topic clusters, i. e, blog topics. We define a topic as a set of significant terms that are clus tered together based on aspects of similarity. By examining the clusters, we can extract the salient features of topics. Moreover, we analyze the click times of Internet readers accessing articles. For each topic cluster, we modified a double exponential smoothing method [6, 7 to predict the popularity degree of the topic according to the variation in trends of click times by Internet readers. Second, mobile users may have different interests regarding the latest popular blog topics. Thus, re further propose a novel approach to infer mobile users' preferred(personalized popular blog topics based on the pre dicted popularity degree of blog topics and mobile users' personal interests, derived by analyzing their browsing logs. Third, a novel hybrid recommendation approach is proposed to recommend blog articles by integrating personalized popularity o topic clusters, item-based collaborative filtering(CF)and attention degree(click times )of blog articles. The major novel ideas are as follows. The hybrid prediction is derived according to the clarity of personal preference derived from collaborative tering, based on the historical behavior of the mobile user. with clear preference, ie. more browsing records of mobile users,the hybrid prediction will be influenced more by user preference prediction based on collaborative filtering. The hy brid prediction is, however, dominated by Internet attention degree of articles for the mobile users who have very few
There are three main types of research regarding blogs. The first type of research focuses on analyzing the link structure between blogs to form a community [19,20]. Through the hyperlinks between blogs, people can communicate across blogs by publishing content related to other blogs. Nakajima et al. [31] proposed a method to identify the important bloggers in the conversations, based on their roles in preceding blog threads, and identify ‘‘hot’’ conversation. The second type of research focuses on content analysis to derive the propagation of topics and trends in the blogsphere. Gruhl et al. [11,12] modeled the information propagation of topics among blogs based on blog text. With the analysis of tracking topic and user drift, Hayes et al. [13] examined the relationship between blogs over time. Mei et al. [28] proposed a method to discover the distributions and evolution patterns across time and space. Although existing studies have investigated the evolution of blog topics, they have not considered how to predict the degree of popularity of blog topics. The last type of research focuses on how to model the bloggers and derive their interests in order to generate personal recommendations [38,40]. A variety of methods has been proposed to model the blogger’s interests and provide recommended content which is similar to their earlier experiences [15,24]. The majority of previous studies on blogs have ignored the hot topics and popular articles discussed by mass groups of readers, who engage in browsing actions related to the blog articles. Moreover, existing studies do not consider recommending blog articles to mobile readers in mobile environments. With more and more blog articles continually being published on the Internet, the scale and complexity of blog contents are growing rapidly, resulting in information overload for blog readers. Mobile readers could only browse a very limited number of blog articles because of the restrictions of mobile devices. Accordingly, traditional recommendation methods, such as the collaborative filtering approach [1,2,5,17,25,35], may suffer the sparsity problem of finding similar users or items due to insufficient historical records of browsing blog articles by mobile readers. To address the sparsity issue and blog information overload, it is essential to design an appropriate mechanism for recommending blog articles in mobile environments. Blog readers are often interested in browsing emerging and popular blog topics, from which the popularity of blogs can be inferred according to the accumulated click times on blogs. Popularity based solely on click times, however, cannot truly reflect popularity trends. For example, a new event may trigger emerging discussions such that the number of related blog articles and browsing actions is small at the beginning and rapidly increases as time goes on. Thus, it is important to analyze the trend of time-sensitive popularity of blogs to predict emerging hot blog topics. In addition, blog readers may have different interests regarding the emerging popular blog topics. Nevertheless, existing researches have not addressed such issues of how to predict the popularity trend of blog topics and personalized popular topics. More specifically, several studies have been proposed to model the blogger’s interest and provide personal recommendations [15,24,38,40]. Traditional approaches of recommender systems can also be adopted to recommend blog articles to mobile users. However, existing researches have not addressed the issue of recommending personalized popular blog articles, which is especially important for mobile environments where mobile users can not freely browse a tremendous amount of blog articles on the Internet due to the restriction of mobile devices, and therefore must rely on service providers’ recommendations to browse a small and feasible subset of blog articles. Many blog articles are new articles to the system, since they have not been viewed by any mobile user in the system due to the limitation of mobile devices. Traditional recommendation methods may suffer from the new item problem, in which there is no rating record on new items by which to derive the prediction [1]. It means that most new articles, which are popular on the Internet and to which the masses of Internet users pay attention, may be ignored by conventional recommendation methods. Accordingly, the recommended feasible set of blog articles should contain those articles which are new articles to the system but are popular with Internet users and also suit mobile users’ personal interests. Existing recommendation approaches have neither addressed such issues nor considered the popularity degree of blog articles. In this work, we propose a novel Customized Content Service on a mobile device (m-CCS) to recommend personalized and popular blog articles to mobile users. Conventional recommender systems mainly employ the users’ behavior logs recorded in the systems to make recommendations. Differing from existing recommender systems, we use an additional data source collected from the Internet, i.e., the Internet users’ click times on blog articles, to identify the popularity degree of blog articles which are integrated with recommendation approach to improve the recommender quality in mobile recommender services. First, we propose a novel approach to predict the trend of time-sensitive popularity of blog topics. We analyze blog contents retrieved by co-RSS to derive topic clusters, i.e., blog topics. We define a topic as a set of significant terms that are clustered together based on aspects of similarity. By examining the clusters, we can extract the salient features of topics. Moreover, we analyze the click times of Internet readers accessing articles. For each topic cluster, we modified a double exponential smoothing method [6,7] to predict the popularity degree of the topic according to the variation in trends of click times by Internet readers. Second, mobile users may have different interests regarding the latest popular blog topics. Thus, we further propose a novel approach to infer mobile users’ preferred (personalized) popular blog topics based on the predicted popularity degree of blog topics and mobile users’ personal interests, derived by analyzing their browsing logs. Third, a novel hybrid recommendation approach is proposed to recommend blog articles by integrating personalized popularity of topic clusters, item-based collaborative filtering (CF) and attention degree (click times) of blog articles. The major novel ideas are as follows. The hybrid prediction is derived according to the clarity of personal preference derived from collaborative filtering, based on the historical behavior of the mobile user. With clear preference, i.e. more browsing records of mobile users, the hybrid prediction will be influenced more by user preference prediction based on collaborative filtering. The hybrid prediction is, however, dominated by Internet attention degree of articles for the mobile users who have very few D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572 1553
1554 D -R Liu et aL/Information Sciences 181(2011)1552-1572 popup g records with which to infer their preferences. Moreover, hybrid prediction considers the predictive personalized ty degree of the topic cluster to which each article belongs: the more popular the topic of an article is, the more numerous the users who are interested in the article The filtered articles are sent to the individuals mobile device via a WAP Push service. This allows the user to receive per- sonalized and relevant articles, satisfying the demand for instant information. Finally, we conduct on-line experiments to compare different strategies: unified push of articles selected by experts and personalized push of articles selected by the m-CCS systems novel recommendation service. The experiment result shows that our proposed approach considering cus- tomized predictive popularity degree can increase the click rates of blog articles to enhance the quality of recommendation. The proposed m-CCS system can effectively recommend desirable blog articles to mobile users based on popularity and per conal interests The remainder of this paper is organized as follows. Section 2 introduces works related to blogs, forecasting and recom- endations; a brief introduction to our system is given in Section 3: detailed descriptions of the processing module of our system are presented in Sections 4 and 5: Section 6 illustrates how to integrate different modules of our system to develop recommendation methods: the system architecture is illustrated in Section 7: Section 8 presents the evaluation of the use- alness of m-CCS empirically and practically; and the conclusions and suggestions for future work are presented in Section 9 2. Literature review 2.1 Discovering the trend of blog topics Blog content represents the opinions of the populace and reactions to current events(e.g. news)on the Internet [13 With Web 2.0, blogs have become such a powerful force that mainstream media cannot help but take notice[ 9]. Several re- searches focus on analyzing blog content to derive the propagation of topics and trends in the blogsphere. Gruhl et al. [11, 12 modeled the information propagation of blog topics, based on blog texts. The patterns they proposed for topic propagation were useful for predicting sales forecasts. In addition, more and more researches have recently been paying attention to studies on blog content. Blog text analysis focuses on eliciting useful information from blog entry collections, and determin- ng certain trends in the blogosphere. A Natural Language Processing(NLP)algorithm has been used to determine the most important keywords within a definite time period; it can automatically discover trends across blogs [9 Nevertheless, the above mentioned researches emphasize assigning blog articles to only one topic, while blogs, in fact, contain many topic Mei et al. [28] focused on a mixture of subtopics and recognize the spatiotemporal topic patterns within blog documents. They proposed a probabilistic method to model the most salient topics from a text collection, and discover the distributions nd evolution patterns across time and space. To track topic and user drift, Hayes et al. [13] examined the relationship be- tween blogs over time. Some studies have investigated the evolution of blog topics. However, most researches have not con- sidered how to predict the popularity degree of blog topics. In addition, researches mainly analyze the content of blog articles to discover the evolution and trend of blog topics without considering the Internet readers' perspective, i.e, the click times of Internet readers on blog articles. Differing from other studies, we identify blog topics by clustering similar blog arti les into clusters (topics), and then use the accumulated Internet readers' click times of blog articles for generating topic clusters by which to predict the popularity degree of blog topics 2.2. Recommending blog articles Several studies investigated user modeling and personal recommendation in blog space. A variety of methods [38, 40] has been proposed to model bloggers'interest, such as classifying articles into predefined categories to identify the author,'s pref- erence[24], and thereby automatically recommend the blog articles which suit their interest, by analyzing the contents to which bloggers have reacted. Huang et al. [15 proposed an approach to extract terms relevant to users from blog articles, and then recommend blog articles explored by Google's search engine. While bloggers can receive recommended content which is similar to that their earlier experiences, the method ignores the hot topics and popular articles discussed by the bulk of readers which can attract mobile users'interest. These studies mainly examined the interests of bloggers and iden- tified which topics were widely discussed by the bloggers without considering the perspectives of Internet readers. They did not address the issue of how to predict the popularity trend of blog topics. Moreover, existing approaches on recommending blog articles did not investigate the recommendation of popular blog articles by considering the popularity degree of blog topics. Differing from existing studies, we recommend personalized and popular blog articles by considering Internet read- ers' click times on blog articles and the predictive popularity degree of blog topics 2.3. Forecasting Forecasting methods mainly use historical data to infer future de bservation values by time order to construct a suitable model to fo he exponential smoothing method [6 is easy to understand and highl le. this method can also use less data to l erbed hort term predictions. The exponential smoothing method assumes ty and regularity in the trend of time series
browsing records with which to infer their preferences. Moreover, hybrid prediction considers the predictive personalized popularity degree of the topic cluster to which each article belongs; the more popular the topic of an article is, the more numerous the users who are interested in the article. The filtered articles are sent to the individual’s mobile device via a WAP Push service. This allows the user to receive personalized and relevant articles, satisfying the demand for instant information. Finally, we conduct on-line experiments to compare different strategies: unified push of articles selected by experts and personalized push of articles selected by the m-CCS system’s novel recommendation service. The experiment result shows that our proposed approach considering customized predictive popularity degree can increase the click rates of blog articles to enhance the quality of recommendation. The proposed m-CCS system can effectively recommend desirable blog articles to mobile users based on popularity and personal interests. The remainder of this paper is organized as follows. Section 2 introduces works related to blogs, forecasting and recommendations; a brief introduction to our system is given in Section 3; detailed descriptions of the processing module of our system are presented in Sections 4 and 5; Section 6 illustrates how to integrate different modules of our system to develop recommendation methods; the system architecture is illustrated in Section 7; Section 8 presents the evaluation of the usefulness of m-CCS empirically and practically; and the conclusions and suggestions for future work are presented in Section 9. 2. Literature review 2.1. Discovering the trend of blog topics Blog content represents the opinions of the populace and reactions to current events (e.g., news) on the Internet [13]. With Web 2.0, blogs have become such a powerful force that mainstream media cannot help but take notice [9]. Several researches focus on analyzing blog content to derive the propagation of topics and trends in the blogsphere. Gruhl et al. [11,12] modeled the information propagation of blog topics, based on blog texts. The patterns they proposed for topic propagation were useful for predicting sales forecasts. In addition, more and more researches have recently been paying attention to studies on blog content. Blog text analysis focuses on eliciting useful information from blog entry collections, and determining certain trends in the blogosphere. A Natural Language Processing (NLP) algorithm has been used to determine the most important keywords within a definite time period; it can automatically discover trends across blogs [9]. Nevertheless, the above mentioned researches emphasize assigning blog articles to only one topic, while blogs, in fact, contain many topics. Mei et al. [28] focused on a mixture of subtopics and recognize the spatiotemporal topic patterns within blog documents. They proposed a probabilistic method to model the most salient topics from a text collection, and discover the distributions and evolution patterns across time and space. To track topic and user drift, Hayes et al. [13] examined the relationship between blogs over time. Some studies have investigated the evolution of blog topics. However, most researches have not considered how to predict the popularity degree of blog topics. In addition, researches mainly analyze the content of blog articles to discover the evolution and trend of blog topics without considering the Internet readers’ perspective, i.e., the click times of Internet readers on blog articles. Differing from other studies, we identify blog topics by clustering similar blog articles into clusters (topics), and then use the accumulated Internet readers’ click times of blog articles for generating topic clusters by which to predict the popularity degree of blog topics. 2.2. Recommending blog articles Several studies investigated user modeling and personal recommendation in blog space. A variety of methods [38,40] has been proposed to model bloggers’ interest, such as classifying articles into predefined categories to identify the author’s preference [24], and thereby automatically recommend the blog articles which suit their interest, by analyzing the contents to which bloggers have reacted. Huang et al. [15] proposed an approach to extract terms relevant to users from blog articles, and then recommend blog articles explored by Google’s search engine. While bloggers can receive recommended content which is similar to that their earlier experiences, the method ignores the hot topics and popular articles discussed by the bulk of readers which can attract mobile users’ interest. These studies mainly examined the interests of bloggers and identified which topics were widely discussed by the bloggers without considering the perspectives of Internet readers. They did not address the issue of how to predict the popularity trend of blog topics. Moreover, existing approaches on recommending blog articles did not investigate the recommendation of popular blog articles by considering the popularity degree of blog topics. Differing from existing studies, we recommend personalized and popular blog articles by considering Internet readers’ click times on blog articles and the predictive popularity degree of blog topics. 2.3. Forecasting Forecasting methods mainly use historical data to infer future development trends. Time series prediction uses a set of observation values by time order to construct a suitable model to forecast future trends. Within the variety of methods, the exponential smoothing method [6] is easy to understand and highly reliable; this method can also use less data to make short term predictions. The exponential smoothing method assumes stability and regularity in the trend of time series. 1554 D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572
D-R. Liu et aL/ Information Sciences 181(2011)1552-1572 A standard exponential smoothing method 30] assigns exponentially decreasing weights to previous observations. In other words, recent observations are given relatively more weight in forecasting than are the older observations. The exponential moothing method has been widely used in short term or medium term economic development trend forecasting In the sim- ple exponential smoothing method, the current prediction value is derived from the prediction value and actual value of the preceding time period. Simple exponential smoothing is suitable for stationary time series which do not exhibit trend effect. The double exponential smoothing approach is usually used to process the time series data with trend effect, and is pre- dicted using Eq (1)[7. For preceding time series, x(t) is the actual value at time t, and x(t)is the prediction value at time t: and b(t)represents the trend effect at time t. To forecast the current value for time t+1, x(t +1)is the average value be- tween two parameters, x(t)and (t)+b(t)], weighted by a which is a smoothing constant. Therefore, the difference of soothing constant would determine which parameter has greater influence in affecting the prediction value. Learning from the formula, each prediction value is weighted from the series value within the past period. The more recent the historical data, the greater the weight of the prediction R(+1)=mx()+(1-)(t)+b(t) b(t)=(t)-X(t-1)+(1-Bb(t-1) The trend effect at time t, b(t) is calculated as Eq (2). The value B is used to weight the difference between two prediction values: x(t) and x(t-1), belonging to adjacent days and the preceding trend effect b(t-1). For the double exponential is to make exponential smoothing method to predict the popularity degree of the topic according to the variation in trends of click times by Internet readers. 4. Re The recommender system is widely used to provide suitable personalized information to users according to their needs and preferences[1-3, 17, 18, 22, 29, 35 ]. The recommender system has been applied in many different areas [36], such as prod ucts [8, 23]. movies [32]. books [10] and music [37], and not only offers personalized recommendation service for each cus- tomer, but also benefits business marketing strategies. Generally, the recommender system mainly includes content-based filtering and collaborative filtering. 6, The content-based filtering(CBF)approach analyzes customers' preferences regarding the items attribute features to ld up a personal feature profile, and then predict which items the customer will like[ 14, 41 In other words, this approach recommends items with similar attribute features to the customer profiles according to their past preferences; it is more likely to be used for document webpage and news article recommendations. However, this method still has some restrictions which need to be improved; it is not easy to analyze the features of items, and users can only receive recommended items [21] The collaborative filtering(CF) approach is one of the most popular recommending approaches, and it has been success fully applied in many areas 4, 32. This method can solve some problems of content-based method mentioned before. There is no need to analyze the contents of an item; the recommended items are identified for target users solely based on the imilarities to the historical profiles of other users. Furthermore, it can deal with items with content dissimilar to those in the past Based on the relationship between items or users, the CF method can be classified into two types [35 user-based CF and item-based CF. User-based CF calculates the similarity between users, and predicts the target user's preference regarding dif- ferent items: GroupLens is an example of such a system [32]. The CF approach involves two steps: neighborhood formati and prediction. The neighborhood of a target user is selected according to his her similarity to other users, and is computed by Pearson correlation coefficient or the cosine measure. Either the k-NN (nearest neighbor) approach or a threshold-based approach is used to choose k users who are most similar to the target user With the numbers of users and items exploding, determining how to quickly produce high quality recommendations and search a large amount of potential neighbors in real time are important issues, especially for commercial systems. The item- based CF method has been proposed to identify the relationships between different items that users had already rated and then ranking recommended items each user has not viewed before; this method has already been applied on the amazon tform [10, achieving good performance. The item-based collaborative filtering(ICF)algorithm [34] first analyzes the relationships between items(e.g, docu- ments ), rather than the relationships between users. Then, the item relationships are used to compute recommendations for users indirectly, by finding items that are similar to other items which the user has previously accessed. Thus, the pre- diction for item j for user u is calculated by the weighted sum of the ratings given by the user for items similar to j and weighted by item similarity, as shown in Eq ( 3: Puj=2i1 wG. i)x Tu ∑1w(
A standard exponential smoothing method [30] assigns exponentially decreasing weights to previous observations. In other words, recent observations are given relatively more weight in forecasting than are the older observations. The exponential smoothing method has been widely used in short term or medium term economic development trend forecasting. In the simple exponential smoothing method, the current prediction value is derived from the prediction value and actual value of the preceding time period. Simple exponential smoothing is suitable for stationary time series which do not exhibit trend effect. The double exponential smoothing approach is usually used to process the time series data with trend effect, and is predicted using Eq. (1) [7]. For preceding time series, x(t) is the actual value at time t, and ^xðtÞ is the prediction value at time t; and b(t) represents the trend effect at time t. To forecast the current value for time t þ 1; ^xðt þ 1Þ is the average value between two parameters, x(t) and ½^xðtÞ þ bðtÞ, weighted by a which is a smoothing constant. Therefore, the difference of smoothing constant would determine which parameter has greater influence in affecting the prediction value. Learning from the formula, each prediction value is weighted from the series value within the past period. The more recent the historical data, the greater the weight of the prediction: ^xðt þ 1Þ ¼ axðtÞþð1 aÞ½^xðtÞ þ bðtÞ; ð1Þ bðtÞ ¼ b½^xðtÞ ^xðt 1Þ þ ð1 bÞbðt 1Þ: ð2Þ The trend effect at time t, b(t) is calculated as Eq. (2). The value b is used to weight the difference between two prediction values: ^xðtÞ and ^xðt 1Þ, belonging to adjacent days and the preceding trend effect, b(t 1). For the double exponential smoothing method, the value of ^xðtÞ and b(1) have to be assigned in the initial stage. The simplest way is to make an assumption for ^xð2Þ ¼ xð1Þ and b(1) = 0. Some research has also suggested that the selection of the initial value is not important toward the stationary [7], since it does not have a significant effect on the prediction result. In this work, we modified a double exponential smoothing method to predict the popularity degree of the topic according to the variation in trends of click times by Internet readers. 2.4. Recommendation approaches The recommender system is widely used to provide suitable personalized information to users according to their needs and preferences [1–3,17,18,22,29,35]. The recommender system has been applied in many different areas [36], such as products [8,23], movies [32], books [10] and music [37], and not only offers personalized recommendation service for each customer, but also benefits business marketing strategies. Generally, the recommender system mainly includes content-based filtering and collaborative filtering. The content-based filtering (CBF) approach analyzes customers’ preferences regarding the item’s attribute features to build up a personal feature profile, and then predict which items the customer will like [14,41]. In other words, this approach recommends items with similar attribute features to the customer profiles according to their past preferences; it is more likely to be used for document webpage and news article recommendations. However, this method still has some restrictions which need to be improved; it is not easy to analyze the features of items, and users can only receive recommended items which are similar to past ones [21]. The collaborative filtering (CF) approach is one of the most popular recommending approaches, and it has been successfully applied in many areas [4,32]. This method can solve some problems of content-based method mentioned before. There is no need to analyze the contents of an item; the recommended items are identified for target users solely based on the similarities to the historical profiles of other users. Furthermore, it can deal with items with content dissimilar to those in the past. Based on the relationship between items or users, the CF method can be classified into two types [35]: user-based CF and item-based CF. User-based CF calculates the similarity between users, and predicts the target user’s preference regarding different items; GroupLens is an example of such a system [32]. The CF approach involves two steps: neighborhood formation and prediction. The neighborhood of a target user is selected according to his/her similarity to other users, and is computed by Pearson correlation coefficient or the cosine measure. Either the k-NN (nearest neighbor) approach or a threshold-based approach is used to choose k users who are most similar to the target user. With the numbers of users and items exploding, determining how to quickly produce high quality recommendations and search a large amount of potential neighbors in real time are important issues, especially for commercial systems. The itembased CF method has been proposed to identify the relationships between different items that users had already rated and then ranking recommended items each user has not viewed before; this method has already been applied on the Amazon platform [10], achieving good performance. The item-based collaborative filtering (ICF) algorithm [34] first analyzes the relationships between items (e.g., documents), rather than the relationships between users. Then, the item relationships are used to compute recommendations for users indirectly, by finding items that are similar to other items which the user has previously accessed. Thus, the prediction for item j for user u is calculated by the weighted sum of the ratings given by the user for items similar to j and weighted by item similarity, as shown in Eq. (3): pu;j ¼ Pn Pi¼1wðj; iÞ ru;i n i¼1jwðj; iÞj ; ð3Þ D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572 1555
1556 D-R. Liu et aL/ Information Sciences 181(2011 )1552-1572 Internet Participant role Publish Browse Click times Content Subsc by co Collect data occording to coRSS Customized content service Mobile users (m-CCS) (target customer ime-sensitive popularity tracking (TPT) Theme cluster Personal favorite analysis(PFA) Extract cu Integrated process behavio recommendation Fig. 1. System overview for m-CCS. where Puy represents the predicted rating of item j for user u: wG. i)is the similarity between two items j and i; and ru, i de- notes the rating of user u for item 1. a number of methods can be used to determine the similarity between items e.g., cosin based similarity, correlation-based similarity, and adjusted cosine similarity methods. Since the adjusted cosine similarity method performs better than the others [34], we used it as the similarity measure for the ICF method. The adjusted cosine similarity between two items i and j is given by Eq. (4): AdjSim(ij) ∑veu(rui-u)(ruj-f) where ru ruy is the rating of item ilj given by user u; and Tu is the average item rating of user The CBF method is limited in being unable to provide serendipitous recommendations since the recommendation is based solely on the content features of items that the user has preferred. The success of collaborative filtering relies on the avail- ability of a sufficiently large set of quality preference ratings provided by users. Accordingly, finding users with similar pref erences is difficult if the user rating matrix is very sparse( few preference ratings), causing the sparsity problem for the CI method. In addition, the CF method may suffer from the new item problem, in which there is no rating record on new items by which to derive the prediction 1 3. System process overview We propose a novel value-added mobile service, namely Customized Content Service on mobiles(m-CCS), to provide cus- tomized blog articles for mobile users based on the time-sensitive popularity of topics and personal preference patterns, as shown in Fig. 1 The first step of our system is to collect blog articles from the Internet. The rss mechanism is a useful way to capture the latest articles automatically without visiting each site. RSS is an abbreviation for Really Simple Syndication, which is an XML document to aggregate information from multiple web sources. Any mobile user can subscribe to RSS feeds. However, there lay be a shortage of information caused by insufficient RSS feeds subscribed to individuals. Thus, we propose a co-RSS meth- od to solve this problem. The co-RSS method gathers all RSS feeds from users such that RSS flocks, called crows-RSS, are formed to enrich information sources. After this preliminary procedure, the system can automatically collect desirable con- tents from diverse resources. Moreover, we use information retrieval technology (e.g. tf-idf approach)[33 to pre-proces articles which are trawled every day from blog websites according to crows-RSS feeds After extracting the features(term vectors)of blog articles, the time-sensitive popularity tracking(TPT) module groups articles into topic clusters and automat ically predicts their trend of popularity. The details of the tPt module are presented in Section 4
where pu,j represents the predicted rating of item j for user u; w(j,i) is the similarity between two items j and i; and ru,i denotes the rating of user u for item i. A number of methods can be used to determine the similarity between items e.g., cosinebased similarity, correlation-based similarity, and adjusted cosine similarity methods. Since the adjusted cosine similarity method performs better than the others [34], we used it as the similarity measure for the ICF method. The adjusted cosine similarity between two items i and j is given by Eq. (4): AdjSimði; jÞ ¼ P u2Uðru;i ruÞðru;j ruÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P u2Uðru;i ruÞ 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P u2Uðru;j ruÞ 2 q ; ð4Þ where ru,i/ ru,j is the rating of item i/j given by user u; and ru is the average item rating of user u. The CBF method is limited in being unable to provide serendipitous recommendations since the recommendation is based solely on the content features of items that the user has preferred. The success of collaborative filtering relies on the availability of a sufficiently large set of quality preference ratings provided by users. Accordingly, finding users with similar preferences is difficult if the user rating matrix is very sparse (few preference ratings), causing the sparsity problem for the CF method. In addition, the CF method may suffer from the new item problem, in which there is no rating record on new items by which to derive the prediction [1]. 3. System process overview We propose a novel value-added mobile service, namely Customized Content Service on mobiles (m-CCS), to provide customized blog articles for mobile users based on the time-sensitive popularity of topics and personal preference patterns, as shown in Fig. 1. The first step of our system is to collect blog articles from the Internet. The RSS mechanism is a useful way to capture the latest articles automatically without visiting each site. RSS is an abbreviation for Really Simple Syndication, which is an XML document to aggregate information from multiple web sources. Any mobile user can subscribe to RSS feeds. However, there may be a shortage of information caused by insufficient RSS feeds subscribed to individuals. Thus, we propose a co-RSS method to solve this problem. The co-RSS method gathers all RSS feeds from users such that RSS flocks, called crows-RSS, are formed to enrich information sources. After this preliminary procedure, the system can automatically collect desirable contents from diverse resources. Moreover, we use information retrieval technology (e.g. tf-idf approach) [33] to pre-process articles which are trawled every day from blog websites according to crows-RSS feeds. After extracting the features (term vectors) of blog articles, the time-sensitive popularity tracking (TPT) module groups articles into topic clusters and automatically predicts their trend of popularity. The details of the TPT module are presented in Section 4. Fig. 1. System overview for m-CCS. 1556 D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572
D-R. Liu et aL/ Information Sciences 181(2011)1552-1572 Current time window Preceding time window O clustering Previous period e trend path construction O prediction of the trend of popularity Fig. 2. ularity Since the viewable content on mobile device screens is limited, designing a personalized service for filtering articles is particularly desirable. The m-CCS can monitor the click rates of articles daily and log user viewing records to infer implicit preference of mobile users. without the effort of user rating, the implicit interest of a user regarding an article is inferred by comparing the time spent on reading the article with the average time spent on articles of the same size. The browsing re- cords of users are analyzed to discover their behavior patterns and then their personal preferences are deduced through a personal favorite analysis(PFA)module. Moreover, the m-CCS predicts a users preferred topics by deriving his/her custom- ized popularity degree of topic clusters according to the predicted popularity of topic clusters and his/her preferences. Se tion 5 presents the details of the PFA module Finally, the system recommends blog articles based on the customized popularity degree of topic clusters and the pr erence of mobile users the recommended articles are then sent to the users mobile device via a wap push service. this allows users to instantly receive personalized and relevant blog articles. The proposed recommendation process of the m- CCS mainly integrates content analysis and collaborative filtering to improve the shortcomings of pure collaborative filtering (CF), including sparsity and cold start issues, as well as aspects such as: (1)the prediction of popular topic cluster of concern to bloggers and readers on the Internet, (2)the prediction of users'preference score by item-based collaborative filtering, and (3)attention degree(click times) of blog articles obtained from Internet users. The detailed descriptions of the recommen- dation process are presented in Section 6. In general, the effectiveness of the CF recommendation approach mostly depends on the set of historical data. There are till potential limitations, such as sparsity and cold start issues [ 2, 39 Low-quality recommendation results may be delivered due to the sparsity issue, namely when the system only has very few rating records of users to measure the similarity be- tween users or items. For the cold start issue of new items or new users, the system will present weak performance in rec- ommendation because of the lack of active records viewed by users. In our research, we focus on mobile users and blog articles. We apply clustering techniques to first group the articles into topic clusters and then form neighborhoods of items from the topic clusters, which can reduce the sparsity problem and im- prove the scalability of recommender systems. Additionally, many blog articles have not been viewed by any mobile user in our system due to the limitations of mobile devices. It means that most articles, which are popular on the Internet and are attractive to the masses of Internet users, may be ignored in the process of recommendation. Thus, our proposed recommen- dation approach not only considers mobile users' preferences concerning the articles which have been pushed to them on the mobile devices, but also considers the perspectives of Internet readers to identify the popularity of articles, in order to im- 4. Time-sensitive In this section, we present a novel approach to predict the trend of time-sensitive popularity of blog topics We identify blog topic clusters and their popularity according to the perspectives of writers and readers on the Internet, and then e the trend of popularity temporally In the following subsections, we illustrate the details of the tracking process shown 4.1. Forming topic clusters of blog articles Articles in blogs are free and usually contain different opinions so that it is difficult to categorize articles into their appro- priate categories as defined by bloggers. That is to say, the existing category in a blog website is insufficient to fully represent
Since the viewable content on mobile device screens is limited, designing a personalized service for filtering articles is particularly desirable. The m-CCS can monitor the click rates of articles daily and log user viewing records to infer implicit preference of mobile users. Without the effort of user rating, the implicit interest of a user regarding an article is inferred by comparing the time spent on reading the article with the average time spent on articles of the same size. The browsing records of users are analyzed to discover their behavior patterns and then their personal preferences are deduced through a personal favorite analysis (PFA) module. Moreover, the m-CCS predicts a user’s preferred topics by deriving his/her customized popularity degree of topic clusters according to the predicted popularity of topic clusters and his/her preferences. Section 5 presents the details of the PFA module. Finally, the system recommends blog articles based on the customized popularity degree of topic clusters and the preference of mobile users. The recommended articles are then sent to the user’s mobile device via a WAP Push service. This allows users to instantly receive personalized and relevant blog articles. The proposed recommendation process of the mCCS mainly integrates content analysis and collaborative filtering to improve the shortcomings of pure collaborative filtering (CF), including sparsity and cold start issues, as well as aspects such as: (1) the prediction of popular topic cluster of concern to bloggers and readers on the Internet, (2) the prediction of users’ preference score by item-based collaborative filtering, and (3) attention degree (click times) of blog articles obtained from Internet users. The detailed descriptions of the recommendation process are presented in Section 6. In general, the effectiveness of the CF recommendation approach mostly depends on the set of historical data. There are still potential limitations, such as sparsity and cold start issues [2,39]. Low-quality recommendation results may be delivered due to the sparsity issue, namely when the system only has very few rating records of users to measure the similarity between users or items. For the cold start issue of new items or new users, the system will present weak performance in recommendation because of the lack of active records viewed by users. In our research, we focus on mobile users and blog articles. We apply clustering techniques to first group the articles into topic clusters and then form neighborhoods of items from the topic clusters, which can reduce the sparsity problem and improve the scalability of recommender systems. Additionally, many blog articles have not been viewed by any mobile user in our system due to the limitations of mobile devices. It means that most articles, which are popular on the Internet and are attractive to the masses of Internet users, may be ignored in the process of recommendation. Thus, our proposed recommendation approach not only considers mobile users’ preferences concerning the articles which have been pushed to them on the mobile devices, but also considers the perspectives of Internet readers to identify the popularity of articles, in order to improve the quality of recommendation. 4. Time-sensitive popularity tracking In this section, we present a novel approach to predict the trend of time-sensitive popularity of blog topics. We identify the blog topic clusters and their popularity according to the perspectives of writers and readers on the Internet, and then trace the trend of popularity temporally. In the following subsections, we illustrate the details of the tracking process shown in Fig. 2. 4.1. Forming topic clusters of blog articles Articles in blogs are free and usually contain different opinions so that it is difficult to categorize articles into their appropriate categories as defined by bloggers. That is to say, the existing category in a blog website is insufficient to fully represent Fig. 2. Time-sensitive popularity tracking process. D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572 1557
1558 D -R Liu et aL/Information Sciences 181(2011)1552-1572 2 day 4 Fig. 3. The trend path of topic clusters. the blog. In our research, we use article features, i.e., term-weight vector, derived from the pre-processing to deal with blog articles which are published within a given time window on the Internet. we collect blog articles from bog websites as the raining corpus to construct the dictionary by applying one of the statistical methods, the log likelihood ratio, to extract meaningful phrases and terms. In addition, blog articles are trawled every day from blog websites according to the cro- wed-RSS feeds Note that the blog training data is periodically updated and trained to update the dictionary. Significant terms/phrases are extracted from the content of an article according to the dictionary derived from the blog training data. In addition, each article is represented as a term vector by using the tf-idf approach [33] to calculate the weight of term i in an article j, as defined in Eq. (5): w=后×lg;f max (requi) where n is the number of articles; n is the nu of articles that contain term i: fiy is the normalized frequency off article; frequ is the frequency of term i in article; and max ui)is the frequency of term I which has the maximum fre in article j r. The size of the time window is set as seven days. That is, all the articles posted in the past seven days will be categorized d recommended to individual users A hierarchical agglomerative algorithm with group-average clustering approach [16 is applied to implement the cluster ing step. It treats each article as a cluster first and then successively merges the pairs of clusters with highest cluster sim- ilarity. The similarities between two articles can be calculated by means of the cosine similarity measure, as shown in Eq (6): sim(di, di)=cos(di, di ld·ldJ‖ The cluster similarity between two clusters is defined as the average pairwise similarities of all pairs of articles from dif- ferent clusters. The cluster similarity between two clusters ri and r is calculated by Eq.(7), where dild is a blog article belonging to the set of blog articles Sri/Sy in Cluster r/r: ISnil/l Srl is the number of blog articles of Sr/ Sry and sim(ds. d)denotes the cosine similarity between the articles d and d sim(di, di) Srills, We stop merging the pairs of clusters when the highest cluster similarity is below a threshold during the merge process. The number of clusters each day is not constant; it depends on the density of the discussed topic. If the density of the topic which people discuss is high, the diversity of the article is low and the numbers of clusters decrease 4.2. Constructing the trend path between clusters belonging to adjacent days To reveal the path of the trend which predicts the popularity degree of current clusters, we measure the cluster similarity etween the target Cluster r and all the Clusters pr belonging to the preceding period, and then select the one with maximum values to construct the link with one of the preceding clusters As blog articles are usually composed of unstructured words, to obtain similarity between two clusters appertaining to two days, we average the value of cosine similarity between articles crossing clusters. The similarity between two clusters (r, pr)in adjacent days is calculated b establishing the linkages, the trend of each current cluster can be derived receding related cluster. As ig. 3, all of the clusters receive a trend path from the preceding cluster. The topic of Cluster1 in day 3 is evolved from in day 2, and so on, and we can use the relationship and similarity between hem to calculate the popularity degi
the blog. In our research, we use article features, i.e., term-weight vector, derived from the pre-processing to deal with blog articles which are published within a given time window on the Internet. We collect blog articles from bog websites as the training corpus to construct the dictionary by applying one of the statistical methods, the log likelihood ratio, to extract meaningful phrases and terms. In addition, blog articles are trawled every day from blog websites according to the crowed-RSS feeds. Note that the blog training data is periodically updated and trained to update the dictionary. Significant terms/phrases are extracted from the content of an article according to the dictionary derived from the blog training data. In addition, each article is represented as a term vector by using the tf-idf approach [33] to calculate the weight of term i in an article j, as defined in Eq. (5): wi;j ¼ fi;j log N ni ; fi;j ¼ freqi;j maxlðfreql;jÞ ; ð5Þ where N is the number of articles; ni is the number of articles that contain term i; fi,j is the normalized frequency of term i in article j; freqi,j is the frequency of term i in article j; and maxl(flj) is the frequency of term l which has the maximum frequency in article j. The size of the time window is set as seven days. That is, all the articles posted in the past seven days will be categorized and recommended to individual users. A hierarchical agglomerative algorithm with group-average clustering approach [16] is applied to implement the clustering step. It treats each article as a cluster first and then successively merges the pairs of clusters with highest cluster similarity. The similarities between two articles can be calculated by means of the cosine similarity measure, as shown in Eq. (6): simðdi; djÞ ¼ cosðd * i; d * jÞ ¼ d * i d * j kd * ikkd * jk : ð6Þ The cluster similarity between two clusters is defined as the average pairwise similarities of all pairs of articles from different clusters. The cluster similarity between two clusters ri and rj is calculated by Eq. (7), where di/dj is a blog article belonging to the set of blog articles Sri/Srj in Cluster ri/rj; jSrij/j Srjj is the number of blog articles of Sri/Srj and sim(di,dj) denotes the cosine similarity between the articles di and dj: similarityðri;rjÞ ¼ P di2Sri P dj2Srj simðdi; djÞ jSrijjSrjj : ð7Þ We stop merging the pairs of clusters when the highest cluster similarity is below a threshold during the merge process. The number of clusters each day is not constant; it depends on the density of the discussed topic. If the density of the topic which people discuss is high, the diversity of the article is low and the numbers of clusters decrease. 4.2. Constructing the trend path between clusters belonging to adjacent days To reveal the path of the trend which predicts the popularity degree of current clusters, we measure the cluster similarity between the target Cluster r and all the Clusters pr belonging to the preceding period, and then select the one with maximum values to construct the link with one of the preceding clusters. As blog articles are usually composed of unstructured words, to obtain similarity between two clusters appertaining to two days, we average the value of cosine similarity between articles crossing clusters. The similarity between two clusters (r,pr) in adjacent days is calculated by Eq. (7). After establishing the linkages, the trend of each current cluster can be derived from the preceding related cluster. As shown in Fig. 3, all of the clusters receive a trend path from the preceding cluster. The topic of Cluster1 in day 3 is evolved from Cluster1 in day 2, and so on, and we can use the relationship and similarity between them to calculate the popularity degree. Cluster1 Cluster2 Cluster3 Cluster1 Cluster3 Cluster2 Cluster4 Cluster1 Cluster3 Cluster2 … … … … day 1 day 2 day 3 day 4 Fig. 3. The trend path of topic clusters. 1558 D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572
D-R. Liu et aL/ Information Sciences 181(2011)1552-1572 Time Horizon Trend discovery/ Predicting X Preceding period Trend discovery PRedic urrent period Trend discovery Predicting Fig. 4. The time series of popularity trend. 4.3. Acquisition of actual popularity degree for each preceding cluster After clustering blog articles to form topic clusters(e.g. theme groups) and constructing the trend path, we mainly eader attention, namely the click times of topic clusters, to derive the popularity degree of each cluster. To help predict the popularity degree of a current cluster, we consider the click times in proportion to the reader attention causing a topic to rise and flourish. After clustering blog articles to form a topic group and constructing the trend path, the actual popularity degree for each preceding cluster can be acquired from the times the articles have been clicked during a previous period. Let Spr denote the set of blog articles in Cluster pr For each preceding Cluster pr, we obtain CT Spr), the total click times of the rticles in Spr on the Internet within the preceding time period t, as defined in Eq. ( 8) CT(Sp)=∑ ClickTimes,(d where the actual click times for blog article di in time t can be represented by ClickTimes (di). Then, the click times can be converted to the actual popularity degree, APDpr(t), which is a normalized value based on the m m ClickTimes over all Sk in the preceding period t, as defined in Eq. ( 9): APDpr(t) CT( Spr Max{( ckTimest(×100% 4.4. Predicting popularity degree of current cluster We analyze the trend evolution of attention from Internet readers to predict the popularity degree of current cluster. The time series of popularity trend is a set of serial observation values by time order, as shown in Fig. 4. We modified the double xponential smoothing method described in Section 2.3 to forecast the degree of popular trend for each cluster of blog topic. We only give brief explanations of some equations of the double exponential smoothing method. Readers can refer to the references [6, 7 for further details. For each Cluster r, we use the weighted average method that combines the actual popularity degree(APD)and predicted popularity degree(PPD)of the preceding period to predict the popularity degree of current clusters on the assumption that the effect of popularity degree decays as days pass, as defined in Eq. (10): D'(t+1)=ax APDpr([)+(1-x)x[PPDpr(t)+bpr(t) where we use Cluster pr at preceding time t to predict the initial popularity degree of Cluster rat time t+ 1 which is denoted by ppD, (t+1). For the preceding Cluster pr at time t, APDr(t)is the actual popularity degree as mentioned above: PPDp r(t) denotes the predictive popularity degree of Cluster pr at time t. The bpr(t)represents the trend effect for the previous period Note that the value of initial predictive popularity degree for current cluster, ppD(t+1), is between zero and one. The parameter a is a smoothing constant between zero and one, which is used to determine the relative importance of actual ularity degree and the predictive popularity degree with trend effect in the preceding period. We combine the difference of the predictive popularity degrees at time t and at time t-1, and the trend effect at time t-1 to calculate the trend effect at time t, b(t), using the weighted average, as defined in eq. (11) bpr(t)=8x[PPDpr(t)-PPDppr(t-1)+(1-8)x bpr(t-1) Note that the Cluster pr is the preceding cluster of r, while the Cluster ppr is the preceding cluster of pr. The PPDpprt-1) and bpp(t- 1)are the predictive popularity degree and trend effect of Cluster ppr at time t-1, respectively. The parameter is a smoothing constant between zero and one, which is used to adjust the relative importance of the difference between the predictive popularity degrees at time t and at time t-1, and the trend effect at time t-1 The values of a and 8 in Eqs. (10)and(11), respectively, can be decided by experts or experimental analysis. The double xponential smoothing approach[7] is usually applied to analyze time series data; however, it does not consider the relation between topic clusters belonging to adjacent time periods In our research, we concentrate on topic clusters in different time periods and construct the topic linkage from the preceding time to the current time as a topic trend path with a popularity degree. Therefore, to link topic clusters, the maximal similarity between adjacent clusters, i.e., current Cluster r and
4.3. Acquisition of actual popularity degree for each preceding cluster After clustering blog articles to form topic clusters (e.g. theme groups) and constructing the trend path, we mainly use reader attention, namely the click times of topic clusters, to derive the popularity degree of each cluster. To help predict the popularity degree of a current cluster, we consider the click times in proportion to the reader attention causing a topic to rise and flourish. After clustering blog articles to form a topic group and constructing the trend path, the actual popularity degree for each preceding cluster can be acquired from the times the articles have been clicked during a previous period. Let Spr denote the set of blog articles in Cluster pr. For each preceding Cluster pr, we obtain CTt(Spr), the total click times of the articles in Spr on the Internet within the preceding time period t, as defined in Eq. (8): CTtðSprÞ ¼ X di2Spr ClickTimestðdiÞ; ð8Þ where the actual click times for blog article di in time t can be represented by ClickTimest(di). Then, the click times can be converted to the actual popularity degree, APDpr(t), which is a normalized value based on the maximum ClickTimes over all Sk in the preceding period t, as defined in Eq. (9): APDprðtÞ ¼ CTtðSprÞ MaxfClickTimestðSkÞg 100%: ð9Þ 4.4. Predicting popularity degree of current cluster We analyze the trend evolution of attention from Internet readers to predict the popularity degree of current cluster. The time series of popularity trend is a set of serial observation values by time order, as shown in Fig. 4. We modified the double exponential smoothing method described in Section 2.3 to forecast the degree of popular trend for each cluster of blog topic. We only give brief explanations of some equations of the double exponential smoothing method. Readers can refer to the references [6,7] for further details. For each Cluster r, we use the weighted average method that combines the actual popularity degree (APD) and predicted popularity degree (PPD) of the preceding period to predict the popularity degree of current clusters on the assumption that the effect of popularity degree decays as days pass, as defined in Eq. (10): PPD0 rðt þ 1Þ ¼ a APDprðtÞþð1 aÞ½PPDprðtÞ þ bprðtÞ; ð10Þ where we use Cluster pr at preceding time t to predict the initial popularity degree of Cluster r at time t + 1 which is denoted by PPD0 rðt þ 1Þ. For the preceding Cluster pr at time t, APDpr(t) is the actual popularity degree as mentioned above; PPDpr(t) denotes the predictive popularity degree of Cluster pr at time t. The bpr(t) represents the trend effect for the previous period. Note that the value of initial predictive popularity degree for current cluster, PPD0 rðt þ 1Þ, is between zero and one. The parameter a is a smoothing constant between zero and one, which is used to determine the relative importance of actual popularity degree and the predictive popularity degree with trend effect in the preceding period. We combine the difference of the predictive popularity degrees at time t and at time t 1, and the trend effect at time t 1 to calculate the trend effect at time t, bpr(t), using the weighted average, as defined in Eq. (11): bprðtÞ ¼ d ½PPDprðtÞ PPDpprðt 1Þ þ ð1 dÞ bpprðt 1Þ: ð11Þ Note that the Cluster pr is the preceding cluster of r, while the Cluster ppr is the preceding cluster of pr. The PPDppr(t 1) and bppr(t 1) are the predictive popularity degree and trend effect of Cluster ppr at time t 1, respectively. The parameter d is a smoothing constant between zero and one, which is used to adjust the relative importance of the difference between the predictive popularity degrees at time t and at time t 1, and the trend effect at time t 1. The values of a and d in Eqs. (10) and (11), respectively, can be decided by experts or experimental analysis. The double exponential smoothing approach [7] is usually applied to analyze time series data; however, it does not consider the relation between topic clusters belonging to adjacent time periods. In our research, we concentrate on topic clusters in different time periods and construct the topic linkage from the preceding time to the current time as a topic trend path with a popularity degree. Therefore, to link topic clusters, the maximal similarity between adjacent clusters, i.e., current Cluster r and Fig. 4. The time series of popularity trend. D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572 1559
1560 D -R Liu et aL/Information Sciences 181(2011)1552-1572 ig. 5. The time series of topic clusters. preceding Cluster pr, as described in Section 4.2, is selected to adjust the predictive popularity degree of Cluster r, as shown in Eq (12). Notably the smaller similarity leads to the lower reliability of the prediction path between two clusters PPDr(t+1)=PPD,(t+1)x similarity(r, pr). In Fig. 5, we take one path of trend which belongs to three-day time periods as an example and set both parameters, a and 8, as 0.3. We use the popularity of Cluster1l, which belongs to Time t, to predict the popularity degree of Cluster 22 in Time t+1. In the same way, Clusterol is useful to infer Cluster22. In the initial stage, the actual popularity degree for Clusterol is ssumed to be 40%. It is reasonable to assume PPDp(t)-APDppr(t-1). PPDpprt-1)=0, and bpp(t-1)=0, at the starting time 0. Likewise, we also assume that predictive popularity degree ppDaustero(t-1)and the trend effect bcustero (t-1) for Clustero1 is zero, respectively. Thus, the initial predictive popularity degree of Cluster1l could be derived, and the value is 40%. Then the similarity across adjacent clusters should be considered to calculate the predictive popularity degree. Suppose that the value of similarity between Clusterol and Cluster1l is 0. 23: we can obtain the predictive popularity de- gree of Clusterll after adjustment as: 40% x 0. 23=9. 2%. Next, we use the values which were derived previously the initial popularity degree of Cluster22 according to Eq(10) PPDauster2 (t+1)=0.3 x APDcuster11(t)+0.7 x[PPDcustern1(t)+bauster11(t)I The value of trend effect, bchrsterlI(t), is derived using Eq(11): bcustern1 (t)=0.3 x[PPDcustern1(t)-PPDcusteron(t-1))+0.7 x baustero1(t-1)=2.76%. e Thus, PPDausterz(t+1)=0.3x 10%+0.7x 19.29+2.76%0)=11.37% The value of similarity between Clusterl1 and ter22 is 0 82. We obtain the final predictive popularity degree as follows: PPDauster22(t+1)=PPDCuster2?(t +1)x similarity( Cluster11, Cluster22)=9.32% 5. Personal favorite analysis o. In this section, we present a novel scheme that models the interests of users who browse blog articles on mobile devices. ur proposed methods are implemented to enhance an existing system running in a real mobile business environment. Be cause of the limited features of mobile devices, it is inconvenient to give explicit relevance ratings of blog articles for mobile users.Thus, the existing system does not provide the function of explicit rating of articles Providing explicit feedback such as rating items may bring users extra burden; because it would disturb the normal browsing process, it would usually be ig nored by users[26]. Accordingly, we analyze the browsing patterns of mobile users as implicit feedback information to de- rive their preferences for blog articles. 5. 1. Analysis of user browsing behavior We model browsing patterns within session time by analyzing the log data of mobile users. a user's browsing pattern is derived by calculating his/her average reading time per word for browsing blog articles within session time. The system re- cords the browsing time of blog articles requested by mobile users to derive the session interval and browsing time for ead article. a timeout mechanism is used to terminate a session automatically when a user does not make any request in a time eriod Calculating the time interval between user requests on articles within each session could estimate a users browsing (stick) time on an article. In order to acquire the browsing pattern for the user u, we analyze the browsing speed. Hus, to get the time per word in this session s, as shown in Eq (13)
preceding Cluster pr, as described in Section 4.2, is selected to adjust the predictive popularity degree of Cluster r, as shown in Eq. (12). Notably, the smaller similarity leads to the lower reliability of the prediction path between two clusters: PPDrðt þ 1Þ ¼ PPD0 rðt þ 1Þ similarityðr; prÞ: ð12Þ In Fig. 5, we take one path of trend which belongs to three-day time periods as an example and set both parameters, a and d, as 0.3. We use the popularity of Cluster11, which belongs to Time t, to predict the popularity degree of Cluster22 in Time t + 1. In the same way, Cluster01 is useful to infer Cluster22. In the initial stage, the actual popularity degree for Cluster01 is assumed to be 40%. It is reasonable to assume PPD0 pr(t) = APDppr(t 1), PPDppr(t 1) = 0, and bppr(t 1) = 0, at the starting time 0. Likewise, we also assume that predictive popularity degree PPDCluster01(t 1) and the trend effect bCluster01(t 1) for Cluster01 is zero, respectively. Thus, the initial predictive popularity degree of Cluster11 could be derived, and the value is 40%. Then the similarity across adjacent clusters should be considered to calculate the predictive popularity degree. Suppose that the value of similarity between Cluster01 and Cluster11 is 0.23; we can obtain the predictive popularity degree of Cluster11 after adjustment as: 40% 0.23 = 9.2%. Next, we use the values which were derived previously to predict the initial popularity degree of Cluster22 according to Eq. (10): PPD0 Cluster22ðt þ 1Þ ¼ 0:3 APDCluster11ðtÞ þ 0:7 ½PPDCluster11ðtÞ þ bCluster11ðtÞ: The value of trend effect, bCluster11(t), is derived using Eq. (11): bCluster11ðtÞ ¼ 0:3 ½PPDCluster11ðtÞ PPDCluster01ðt 1Þ þ 0:7 bCluster01ðt 1Þ ¼ 2:76%: Thus, PPD0 Cluster22ðt þ 1Þ ¼ 0:3 10% þ 0:7 ½9:2% þ 2:76% ¼ 11:37%. The value of similarity between Cluster11 and Cluster22 is 0.82. We obtain the final predictive popularity degree as follows: PPDCluster22ðt þ 1Þ ¼ PPD0 Cluster22ðt þ 1Þ similarityðCluster11; Cluster22Þ ¼ 9:32%: 5. Personal favorite analysis In this section, we present a novel scheme that models the interests of users who browse blog articles on mobile devices. Our proposed methods are implemented to enhance an existing system running in a real mobile business environment. Because of the limited features of mobile devices, it is inconvenient to give explicit relevance ratings of blog articles for mobile users. Thus, the existing system does not provide the function of explicit rating of articles. Providing explicit feedback such as rating items may bring users extra burden; because it would disturb the normal browsing process, it would usually be ignored by users [26]. Accordingly, we analyze the browsing patterns of mobile users as implicit feedback information to derive their preferences for blog articles. 5.1. Analysis of user browsing behavior We model browsing patterns within session time by analyzing the log data of mobile users. A user’s browsing pattern is derived by calculating his/her average reading time per word for browsing blog articles within session time. The system records the browsing time of blog articles requested by mobile users to derive the session interval and browsing time for each article. A timeout mechanism is used to terminate a session automatically when a user does not make any request in a time period. Calculating the time interval between user requests on articles within each session could estimate a user’s browsing (stick) time on an article. In order to acquire the browsing pattern for the user u, we analyze the browsing speed, Hu,s, to get the average browsing time per word in this session s, as shown in Eq. (13): Time t-1 Time t Time t+1 Cluster 01 Cluster 02 Cluster 11 Cluster 12 Cluster 13 Cluster 21 Cluster 22 Similarity=0.23 Similarity=0.82 APD=40% APD =10% Fig. 5. The time series of topic clusters. 1560 D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572
D-R. Liu et aL/ Information Sciences 181(2011)1552-1572 1561 DusI DocSize(di) (13) where d, is an article i that the user u had browsed within session s: Dus is a set of articles browsed by user u in session S: ID, denotes the number of articles in Dus; DocSize(d) identifies the number of words of the article; and Time(d) denotes the user u's browsing time on blog article di. After obtaining a user's current browsing behavion which is viewed as the users recent pattern within one session, we use a weighted approach to predict a users future browsing pattern by an incremental approach, which incrementally modifies the former browsing pattern employing the users current browsing behavior. The parameters B can be adjusted in order to set one as more important than the other. we believe that recent browsing behavior has a greater effect upon the future behavior of the mobile user, so we set the parameter B to give recent patterns more weight. The predicted browsing pattern is calculated by using Eg. (14), where H'us denotes former browsing pattern which has been accumulated till session s for mobile user u. Then we can use the new browsing pattern at session s, i.e., Hus, to predict the future behavior at new session s+1 EX (1-B)×H 山s 5.2. Inferring user preference for article time on an article, we can infer how interested the user is in the article and its corresponding preference score. If the brows- ng time is longer than usual, we can estimate that the user has a high preference level for the article. According to the user's browsing behavior in usual time, we employ the users browsing pattern mentioned in Section 5.1 to estimate the browsing time for the article and calculate the predict Browsing Time, PBt di, to compare with Actual Brows- ig Time, ABTu(di). of the user. The predict browsing time PBTu(d)is equal to DocSize(dl)x Hus+I, where DocSize(di)is the size (number of words) of blog article d and Hus+, denote the average browsing time per word for user u as described in Section 5. 1. Then, we calculate the preference score(PS)for target user u on blog article d as follows PSu(di) We can observe that the value of this function is in the range(. 1); the higher value of preference score means that the er has more interest in the article 6. Hybrid recommendation In this section, we propose a novel hybrid method that combines user preference prediction by collaborative filtering. Internet attention degrees of articles, and customized popularity degree of topic cluster, in order to recommend personalized blog articles to mobile users The basic idea of this process is to integrate the different viewpoints of mobile users and Internet users. We use an item- ed collaborative filtering approach to recommend the latent articles of interest according to the actual browsing behavior of mobile users. However, the Cf approach suffers from the sparsity and cold start issues. Because of the limitations of the mobile device, the mobile user cannot easily surf blog articles and a lot of articles are never browsed by mobile users. It means that most popular articles on the Internet, attractive to the masses of Internet users, may be ignored in the process of recommendation. Thus, our proposed recommendation approach not only considers the mobile users' preferences con- cerning the articles which have been pushed to them on the mobile devices, but also considers the viewpoints of Internet readers to identify the attention degree of articles, in order to improve the quality of recommendation. We also consider the predictive popularity degree of the topic cluster to which each article belongs. The more popular the topic of an article is the more users there will be who are interested in the article 6. 1. Topic-based collaborative filtering Research has demonstrated that the item-based Cf approach can efficiently produce high-quality recommendations. The item-based CF method usually computes item similarity based on the whole set of items. However, user preferences on items of different clusters may vary, since the items of different clusters have different characteristics. Mobile users with similar preferences on a topic cluster(e.g. movies)may have different interests in other topics. As mentioned previously, we apply clustering techniques to group the articles into topic clusters first and then form neighborhoods of items from the topic clus- ters; this can improve the scalability of recommender systems. For each topic cluster, we adopt the item-based CF method to predict mobile users' preferred articles, due to the efficiency concern for commercial systems We use the adjusted cosine [34 to measure the similarity between two articles, d and d which belong to Cluster r, as defined in Eq.(16). The set of users who co-rate both d and d is denoted by Uj. The PSu(di)is the preference score of the user u on article di: PSu is the average preference score of mobile user u:
Hu;s ¼ 1 jDu;sj X di2Du;s TimeuðdiÞ DocSizeðdiÞ ; ð13Þ where di is an article i that the user u had browsed within session s; Du,s is a set of articles browsed by user u in session s; jDu,sj denotes the number of articles in Du,s; DocSize (di) identifies the number of words of the article; and Timeu(di) denotes the user u’s browsing time on blog article di. After obtaining a user’s current browsing behavior, Hu,s, which is viewed as the user’s recent pattern within one session, we use a weighted approach to predict a user’s future browsing pattern by an incremental approach, which incrementally modifies the former browsing pattern employing the user’s current browsing behavior. The parameters b can be adjusted in order to set one as more important than the other. We believe that recent browsing behavior has a greater effect upon the future behavior of the mobile user, so we set the parameter b to give recent patterns more weight. The predicted browsing pattern is calculated by using Eq. (14), where H0 u,s denotes former browsing pattern which has been accumulated till session s for mobile user u. Then we can use the new browsing pattern at session s, i.e., Hu,s, to predict the future behavior at new session s + 1: H0 u;sþ1 ¼ b Hu;s þ ð1 bÞ H0 u;s: ð14Þ 5.2. Inferring user preference for articles In this step, we infer user preferences for articles based on their browsing behavior that is considered as implicit feedback information. Previous studies [27] have also found that reading time is indicative of interest. By analyzing a user’s browsing time on an article, we can infer how interested the user is in the article and its corresponding preference score. If the browsing time is longer than usual, we can estimate that the user has a high preference level for the article. According to the user’s browsing behavior in usual time, we employ the user’s browsing pattern mentioned in Section 5.1 to estimate the browsing time for the article and calculate the Predict Browsing Time, PBTu(di), to compare with Actual Browsing Time, ABTu(di), of the user. The predict browsing time PBTu(di) is equal to DocSizeðdiÞ H0 u;sþ1, where DocSize (di) is the size (number of words) of blog article di and H0 u;sþ1 denote the average browsing time per word for user u as described in Section 5.1. Then, we calculate the preference score (PS) for target user u on blog article di as follows: PSuðdiÞ ¼ 1 1 þ PBTuðdiÞ ABTuðdiÞ : ð15Þ We can observe that the value of this function is in the range (0,1); the higher value of preference score means that the user has more interest in the article. 6. Hybrid recommendation In this section, we propose a novel hybrid method that combines user preference prediction by collaborative filtering, Internet attention degrees of articles, and customized popularity degree of topic cluster, in order to recommend personalized blog articles to mobile users. The basic idea of this process is to integrate the different viewpoints of mobile users and Internet users. We use an itembased collaborative filtering approach to recommend the latent articles of interest according to the actual browsing behavior of mobile users. However, the CF approach suffers from the sparsity and cold start issues. Because of the limitations of the mobile device, the mobile user cannot easily surf blog articles and a lot of articles are never browsed by mobile users. It means that most popular articles on the Internet, attractive to the masses of Internet users, may be ignored in the process of recommendation. Thus, our proposed recommendation approach not only considers the mobile users’ preferences concerning the articles which have been pushed to them on the mobile devices, but also considers the viewpoints of Internet readers to identify the attention degree of articles, in order to improve the quality of recommendation. We also consider the predictive popularity degree of the topic cluster to which each article belongs. The more popular the topic of an article is the more users there will be who are interested in the article. 6.1. Topic-based collaborative filtering Research has demonstrated that the item-based CF approach can efficiently produce high-quality recommendations. The item-based CF method usually computes item similarity based on the whole set of items. However, user preferences on items of different clusters may vary, since the items of different clusters have different characteristics. Mobile users with similar preferences on a topic cluster (e.g. movies) may have different interests in other topics. As mentioned previously, we apply clustering techniques to group the articles into topic clusters first and then form neighborhoods of items from the topic clusters; this can improve the scalability of recommender systems. For each topic cluster, we adopt the item-based CF method to predict mobile users’ preferred articles, due to the efficiency concern for commercial systems. We use the adjusted cosine [34] to measure the similarity between two articles, di and dj, which belong to Cluster r, as defined in Eq. (16). The set of users who co-rate both di and dj is denoted by Uij. The PSu(di) is the preference score of the user u on article di; PSu is the average preference score of mobile user u: D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572 1561