Categorizing User Interests in Recommender Systems Sourav Saha, Sandipan Majumder, Sanjog Ray, and Ambuj Mahanti I IM Calcutta souravso8eiimcal.ac.in,sandipan.compenggagmail.com IIM Indore sanjogr@iimidr ac in Abstract. The traditional method of recommender systems suffers from the Sparsity problem whereby incomplete dataset results in poor recommendations Another issue is the drifting preference, i.e. the change of the user's prefe with time. In this paper, we propose an algorithm that takes minimal inputs to do away with the Sparsity problem and takes the drift into consideration giving more priority to latest data. The streams of elements are decomposed into the orresponding attributes and are classified in a preferential list with tags as Sporadic”,"New”;" Regular”,"Old”and"Past"- each category signifying a hanging preference over the previous respectively. A repeated occurrence of attribute set of interest implies the users preference for such attribute(s) The proposed algorithm is based on drifting preference and has been tested with the Yahoo Webscope R4 dataset. Results have shown that our algorithm have shown significant improvements over the comparable"Sliding Window algorithm Keywords: Drifting Preference, Recommender Systems, Collaborative, Yahoo Movies 1 Introduction Recommender systems are technology based systems that provide personalized recommendations to users. They generate recommendations by profiling each user Profiling is done by observing each users interests, online behavior and transaction history. Recommendations can also take into account opinions and actions of other users with similar tastes. Recommender systems algorithms can be classified into two major categories, namely content based and collaborative filtering based. A third approach called hybrid approach combines both content based and collaborative fil- tering based methods. In content based recommendations, a user is recommended items similar to the items he preferred in the past For content based recommenda- tions, each item in the items domain must be characterized by a set of keywords that describe it. In content-based recommendations, a profile of a user is first created by keyword analysis of the items previously seen and rated by him. On the basis of the This research has been supported by Indian Institute of Management Calcutta(IIM Calcutta), work order no. 019/RP: SIRS//397/2008-09 R. Setchi et al(Eds ) KES 2010, Part Il, LNAI 6277, Pp. 282-291, 2010 e Springer-Verlag Berlin Heidelberg 2010
R. Setchi et al. (Eds.): KES 2010, Part II, LNAI 6277, pp. 282–291, 2010. © Springer-Verlag Berlin Heidelberg 2010 Categorizing User Interests in Recommender Systems* Sourav Saha1 , Sandipan Majumder1 , Sanjog Ray2 , and Ambuj Mahanti1 1 IM Calcutta souravs08@iimcal.ac.in, sandipan.compengg@gmail.com, am@iimcal.ac.in 2 IIM Indore sanjogr@iimidr.ac.in Abstract. The traditional method of recommender systems suffers from the Sparsity problem whereby incomplete dataset results in poor recommendations. Another issue is the drifting preference, i.e. the change of the user’s preference with time. In this paper, we propose an algorithm that takes minimal inputs to do away with the Sparsity problem and takes the drift into consideration giving more priority to latest data. The streams of elements are decomposed into the corresponding attributes and are classified in a preferential list with tags as “Sporadic”, “New”, “Regular”, “Old” and “Past” – each category signifying a changing preference over the previous respectively. A repeated occurrence of attribute set of interest implies the user’s preference for such attribute(s). The proposed algorithm is based on drifting preference and has been tested with the Yahoo Webscope R4 dataset. Results have shown that our algorithm have shown significant improvements over the comparable “Sliding Window” algorithm. Keywords: Drifting Preference, Recommender Systems, Collaborative, Yahoo Movies. 1 Introduction Recommender systems are technology based systems that provide personalized recommendations to users. They generate recommendations by profiling each user. Profiling is done by observing each user’s interests, online behavior and transaction history. Recommendations can also take into account opinions and actions of other users with similar tastes. Recommender systems algorithms can be classified into two major categories, namely content based and collaborative filtering based. A third approach called hybrid approach combines both content based and collaborative filtering based methods. In content based recommendations, a user is recommended items similar to the items he preferred in the past. For content based recommendations, each item in the items domain must be characterized by a set of keywords that describe it. In content-based recommendations, a profile of a user is first created by keyword analysis of the items previously seen and rated by him. On the basis of the * This research has been supported by Indian Institute of Management Calcutta (IIM Calcutta), work order no. 019/RP: SIRS//3397/2008-09
Categorizing User Interests in Recommender Systems 283 profile created, the user is then recommended new items. Content based systems are not very popular as they lag in recommendation quality when the users tastes change Also, in many domains getting high quality content data is a major issue. In collabora tive filtering, the user is recommended items that people with similar tastes and prefe rences liked in the past. This technique mainly relies on explicit ratings given by the user and is the most successful and widely used technique. There are two primary approaches that are used to build collaborative filtering memory based recommender systems: user based collaborative filtering [1] and item based collaborative filtering [3]. In user based collaborative filtering, a neighborhood of k similar users is found for a particular user. Items are then recommended to the user from the set of items seen by its k nearest neighbors but not yet seen by him. In item based collaborative filtering, similarities between the various items are computed. Items are then recom mended to the user from the set of k items that have highest similarity to the items already present in the user's profile One drawback of traditional recommendation systems are that they assume user in- terests to be static, which is not the case as preferences change with time. This change in preference can occur in two ways. The first kind of change that is most widely seen occurs when a user develops a new interest. Some common examples of this kind change are a movie viewer acquiring a recent liking for western movies, a book reader developing a new interest for books by an author or on a particular subject Examples of users developing new interests are prevalent in most domains where recommender systems are extensively used like books, movies, music, research pa pers, television etc. Popular recommender system algorithms, like user based colla borative filtering and item based collaborative filtering algorithms, use only ratings data to generate recommendation. As context data like genre, subject and keywor are not used, the recommendations generated by these systems are not specifically customized for the user's c interests, and as a result there is lag before they can detect change in his preferences. This inability of systems to quickly recognize a shift in interest is also known as the concept drift problem. The second kind of change occurs when user preference for a particular item hanges over time. When asked to re-rate movies, users may re-rate the same item differently. User preferences evolve over time: a user who has rated an item highly five years back may not hold the same opinion about the item currently. Existing algorithms analyze a user's complete movie rating data to create a profile. As equal weights are given to all the items in the user's profile, the profile created may not give the true picture of the user's current preferences. These drawbacks of existing algo- rithms can hamper the effectiveness of the recommender systems, which can lead to missed opportunities to market products to a user and can affect user confidence to- wards the recommender system. Most of the previous research on the problem of concept drift has focused on plying weights to time series data. The weights applied are based on a time decaying function with higher weights applied to new items. Older data points are removed or very low weights are applied to them so as to decrease their influence on recommen- dations generated. This approach can be helpful in solving the problem of concept drift when the second kind of interest change occurs, i.e. when the user's preference for a particular item changes over time and the present data points present a truer picture of the user's interest. But this approach of using only current data points for
Categorizing User Interests in Recommender Systems 283 profile created, the user is then recommended new items. Content based systems are not very popular as they lag in recommendation quality when the user’s tastes change. Also, in many domains getting high quality content data is a major issue. In collaborative filtering, the user is recommended items that people with similar tastes and preferences liked in the past. This technique mainly relies on explicit ratings given by the user and is the most successful and widely used technique. There are two primary approaches that are used to build collaborative filtering memory based recommender systems: user based collaborative filtering [1] and item based collaborative filtering [3]. In user based collaborative filtering, a neighborhood of k similar users is found for a particular user. Items are then recommended to the user from the set of items seen by its k nearest neighbors but not yet seen by him. In item based collaborative filtering, similarities between the various items are computed. Items are then recommended to the user from the set of k items that have highest similarity to the items already present in the user’s profile. One drawback of traditional recommendation systems are that they assume user interests to be static, which is not the case as preferences change with time. This change in preference can occur in two ways. The first kind of change that is most widely seen occurs when a user develops a new interest. Some common examples of this kind change are a movie viewer acquiring a recent liking for western movies, a book reader developing a new interest for books by an author or on a particular subject. Examples of users developing new interests are prevalent in most domains where recommender systems are extensively used like books, movies, music, research papers, television etc. Popular recommender system algorithms, like user based collaborative filtering and item based collaborative filtering algorithms, use only ratings data to generate recommendation. As context data like genre, subject and keywords are not used, the recommendations generated by these systems are not specifically customized for the user’s current interests, and as a result there is lag before they can detect change in his preferences. This inability of systems to quickly recognize a shift in interest is also known as the concept drift problem. The second kind of change occurs when user preference for a particular item changes over time. When asked to re-rate movies, users may re-rate the same item differently. User preferences evolve over time: a user who has rated an item highly five years back may not hold the same opinion about the item currently. Existing algorithms analyze a user’s complete movie rating data to create a profile. As equal weights are given to all the items in the user’s profile, the profile created may not give the true picture of the user’s current preferences. These drawbacks of existing algorithms can hamper the effectiveness of the recommender systems, which can lead to missed opportunities to market products to a user and can affect user confidence towards the recommender system. Most of the previous research on the problem of concept drift has focused on applying weights to time series data. The weights applied are based on a time decaying function with higher weights applied to new items. Older data points are removed or very low weights are applied to them so as to decrease their influence on recommendations generated. This approach can be helpful in solving the problem of concept drift when the second kind of interest change occurs, i.e. when the user’s preference for a particular item changes over time and the present data points present a truer picture of the user’s interest. But this approach of using only current data points for
S. Saha et al generating recommendations further magnifies the problem of data sparsity. While previous research on concept drift in recommender systems have tried to tackle the problem by using weighted data, none of them have tried to use content data like genre, keywords etc. And without using content data, the more prevalent problem of oncept drift due to new interests cannot be solved. In this paper, we propose an approach to categorize user preferences from its trans action history in a recommender system. Our approach can be used by context based collaborative filtering based and hybrid recommender system algorithms to make better predictions. Our approach of categorizing user interests can be used as one component of a hybrid algorithm designed for solving the concept drift problem. In section 2 we describe our categorization approach for user interests. In section 3 we provide an example to further clarify our approach. In section 4 we review previous research done in the area of drifting preferences. Section 5 describes the experimental evaluation. Section 6 has the discussions on the results obtained. Finally, we conclude our paper in section 7. 2 Proposed Model for Categorizing User Interests A user's interests are identified by analyzing the attribute or content descriptor of items associated with his transactions. In movies domain, movie genres like action, drama, comedy, animation, etc, are used to characterize a movie. If a user only se lects movies belonging to drama and action genres then we can say that the user's interests are drama and action. Similarly, in books domain, subject keywords are used to identify interests. While in music domain genres like rock, pop, country music etc are used to identify interests. Our approach is relevant for all domains where items can be differentiated on the basis of content descriptors and can be grouped into cate- gories based on the content descriptors. In this paper, we have used movies domain to illustrate our approach. We use the terminology interests and genres interchangeably In movies domain, the genre of the movie is the most popular content descriptor data that is used to describe movie content. A single movie can have more than one genre associated with it. For example, the movie Casablanca is associated with drama romance and war genres. To elaborate our categorization model, we first describe the different classes we have defined to classify a user set of interests derived from his transactions, and then we explain our design for movement of interests from one class to another for a user 2.1 Interests Categorization In our proposed approach, a user's interests can be categorized into five classes Sporadic: Interests categorized as"Sporadic"category for a user are those attributes that occur once in a while in the user's transactions. Sporadic"interests are initially considered as a deviation from the users regular interests and seen as transient inter ests, as the probability of them occurring again in future transactions is minimal Recommender systems should ideally ignore these interests as ing the reoccur rence of these interests in future transactions is not feasible. A user buying a gif can be seen as an example of"Sporadic"interests. A user who does not like the
284 S. Saha et al. generating recommendations further magnifies the problem of data sparsity. While previous research on concept drift in recommender systems have tried to tackle the problem by using weighted data, none of them have tried to use content data like genre, keywords etc. And without using content data, the more prevalent problem of concept drift due to new interests cannot be solved. In this paper, we propose an approach to categorize user preferences from its transaction history in a recommender system. Our approach can be used by context based, collaborative filtering based and hybrid recommender system algorithms to make better predictions. Our approach of categorizing user interests can be used as one component of a hybrid algorithm designed for solving the concept drift problem. In section 2 we describe our categorization approach for user interests. In section 3 we provide an example to further clarify our approach. In section 4 we review previous research done in the area of drifting preferences. Section 5 describes the experimental evaluation. Section 6 has the discussions on the results obtained. Finally, we conclude our paper in section 7. 2 Proposed Model for Categorizing User Interests A user’s interests are identified by analyzing the attribute or content descriptor of items associated with his transactions. In movies domain, movie genres like action, drama, comedy, animation, etc., are used to characterize a movie. If a user only selects movies belonging to drama and action genres then we can say that the user’s interests are drama and action. Similarly, in books domain, subject keywords are used to identify interests. While in music domain genres like rock, pop, country music etc are used to identify interests. Our approach is relevant for all domains where items can be differentiated on the basis of content descriptors and can be grouped into categories based on the content descriptors. In this paper, we have used movies domain to illustrate our approach. We use the terminology interests and genres interchangeably. In movies domain, the genre of the movie is the most popular content descriptor data that is used to describe movie content. A single movie can have more than one genre associated with it. For example, the movie Casablanca is associated with drama, romance and war genres. To elaborate our categorization model, we first describe the different classes we have defined to classify a user set of interests derived from his transactions, and then we explain our design for movement of interests from one class to another for a user. 2.1 Interests Categorization In our proposed approach, a user’s interests can be categorized into five classes. Sporadic: Interests categorized as “Sporadic” category for a user are those attributes that occur once in a while in the user’s transactions. “Sporadic” interests are initially considered as a deviation from the user’s regular interests and seen as transient interests, as the probability of them occurring again in future transactions is minimal. Recommender systems should ideally ignore these interests as predicting the reoccurrence of these interests in future transactions is not feasible. A user buying a gift can be seen as an example of “Sporadic” interests. A user who does not like the
Categorizing User Interests in Recommender Systems 285 science-fiction genre may buy science-fiction books as gifts for friends or family. This change in user interest should be considered as a passing interest rather than a new interest. Recommender systems that consider sporadic interests as a change in a users taste will end up recommending items that may not be useful to the user. The chal lenge is to identify those sporadic interests that have the potential to become new interests. Interests that were once"Sporadic"but don' t recur in future has to leave the system altogether. If it starts recurring it has to start afresh as "Fresh Interest"(Fig-1) New: New" interests are those interests that are new to the user but have been occur- ring more often in the users recent transactions. For a user, few of his interests cate- radic"interests become categorized as "New"interests after they occur more often in successive transactions. Similarly, non-occurrences of an interest existing in"New"category in successive transactions will result in the interest shift ing back to the ""Sporadic"category of the user. Similarly, interests in the"New category that occur frequently in successive transactions get categorized as"Regular interest for the user. "New"interests are of much importance from a marketer's point of view: as the user is exploring a new area of interest, he is more likely to be influ enced by a recommendation made by the system. Regular: Once the user has subscribed to some particular attributes of interest a con- siderable number of times, such an interest can be put in the" Regular"category. This particular type of interest is assumed to hold good for the future transactions of the user as well and repeated occurrences only reinforce the belief. It needs to be mentioned that for every categories, inclusion of an interest takes place after the cor responding threshold is overcome(similarly, breaching the threshold results in exclu sion and shifting to a lower category) Old: Interests once categorized in the"Regular "Category of a user gets classified in the"Old"Category when they stop occurring in frequent successive transactions. An interest classified as"Old"interests can get reclassified as a"Regular"interest with increased occurrence in recent transactions(and thereby successfully crossing the threshold). "Old"and"New"have nearly similar preferences during prediction. Past: Successive non-occurrence of an interest in the"Old"category results it being shifted to the"Past"category. " Past"categories can have equal or more preferences compared to "Sporadic"category but they definitely have a lower preference com- pared to the“New" category. Interests in the“ Past"category can turn back to“Old” with repeated occurrences or may have to leave the system altogether with repeated non-occurrences. An interest having once left the system if starts recurring once again later, it has to follow the channel of fresh Interest as shown in Fig-l below. An inter est leaving the system from"Sporadic"or"Past"has to start afresh 2. 2 Movement of Interests The designing of the algorithm here takes care of the drifting effect. Repeated occur- rence of interest adds more weightage to that interest. Alternatively, non-occurrence decreases the weightage. The movement of the interest depends on the threshold de- fined for the particular category. With repeated occurrences the interest is moved to the next higher category if it's more than the threshold defined for the next category Similarly, if some interest stops occurring, then its weightage initially remains
Categorizing User Interests in Recommender Systems 285 science-fiction genre may buy science-fiction books as gifts for friends or family. This change in user interest should be considered as a passing interest rather than a new interest. Recommender systems that consider sporadic interests as a change in a user’s taste will end up recommending items that may not be useful to the user. The challenge is to identify those sporadic interests that have the potential to become new interests. Interests that were once “Sporadic” but don’t recur in future has to leave the system altogether. If it starts recurring it has to start afresh as “Fresh Interest” (Fig-1). New: “New” interests are those interests that are new to the user but have been occurring more often in the user’s recent transactions. For a user, few of his interests categorized as “Sporadic” interests become categorized as “New” interests after they occur more often in successive transactions. Similarly, non-occurrences of an interest existing in “New” category in successive transactions will result in the interest shifting back to the “Sporadic” category of the user. Similarly, interests in the “New” category that occur frequently in successive transactions get categorized as “Regular” interest for the user. “New” interests are of much importance from a marketer’s point of view: as the user is exploring a new area of interest, he is more likely to be influenced by a recommendation made by the system. Regular: Once the user has subscribed to some particular attributes of interest a considerable number of times, such an interest can be put in the “Regular” category. This particular type of interest is assumed to hold good for the future transactions of the user as well and repeated occurrences only reinforce the belief. It needs to be mentioned that for every categories, inclusion of an interest takes place after the corresponding threshold is overcome (similarly, breaching the threshold results in exclusion and shifting to a lower category). Old: Interests once categorized in the “Regular” Category of a user gets classified in the “Old” Category when they stop occurring in frequent successive transactions. An interest classified as “Old” interests can get reclassified as a “Regular” interest with increased occurrence in recent transactions (and thereby successfully crossing the threshold). “Old” and “New” have nearly similar preferences during prediction. Past: Successive non-occurrence of an interest in the “Old” category results it being shifted to the “Past” category. “Past” categories can have equal or more preferences compared to “Sporadic” category but they definitely have a lower preference compared to the “New” category. Interests in the “Past” category can turn back to “Old” with repeated occurrences or may have to leave the system altogether with repeated non-occurrences. An interest having once left the system if starts recurring once again later, it has to follow the channel of Fresh Interest as shown in Fig-1 below. An interest leaving the system from “Sporadic” or “Past” has to start afresh. 2.2 Movement of Interests The designing of the algorithm here takes care of the drifting effect. Repeated occurrence of interest adds more weightage to that interest. Alternatively, non-occurrence decreases the weightage. The movement of the interest depends on the threshold defined for the particular category. With repeated occurrences the interest is moved to the next higher category if it’s more than the threshold defined for the next category. Similarly, if some interest stops occurring, then its weightage initially remains
S. Saha et al constant for the next few transactions and then finally stops dropping. Both the decay percentage and the waiting period depend on the preference category to which the interest currently belongs. Thus if the users had an inclination for a given interest ometime back, which has stopped recurring then such preference initially waits for the next few transaction and then starts fading off and finally goes away threshold hreshold Fresh Interests Nev Lost interes old f Lets define the new algorithm now. We assume the data is in the chronological order, grouped by the user. Let there be U1, U2, Ui. UN user available. A particular user, denoted by Ui has moved through mi elements. Thus, the total number of records in the system is given by Xinmi=M= Xj=1Rj Define the thresholds setT=(I.E=(TS ITN ITR], To, TP))for Sporadic", "New","Regular, "Old"and"Past"category respectively, where each of the Ts contain the threshold for both inclusion and exclusion for the corresponding category. Define the weight function f(c) where c is the category in which the genre belongs currently. The h(c, w) is an activation function taking binary values(0 or 1) where w is the pre-defined window size for successive non-occurrence of any given attribute that currently belong to a particular category c(here Sporadic, New, Regular, Old and Past)after which the decaying function gets activated. Hence the decaying function d(c, w) is the product of f(c)h(c, w) Initially, when an attribute comes into as a fresh interest for a given user, it's given some weightage based on the frequency of occurrence. Once the cumulative weights cross the threshold for the starting category, which is"Sporadic", such attribute of interest is tagged as"". From" Sporadic"the attribute can move to"N category and then to the"Regular"category, each time satisfying the threshold for the given category (refer Fig-1). If the attribute stops recurring indicating a loss of inter est or oversaturation, its weight starts decreasing by a certain percentage after a given waiting period(pre-defined window size) based on the category to which it belong currently. The interest moves to the lower categories each time the threshold for a given category is breached and then finally moves out of the system. 3 An Example The following example illustrates our approach more clearly. Let us consider a who rents movies(the element under consideration) in last 10 transactions. The lowing table shows how the attribute of interest(here movie genres) shift across five sets for this user on the basis of his movie selection behavior
286 S. Saha et al. constant for the next few transactions and then finally stops dropping. Both the decay percentage and the waiting period depend on the preference category to which the interest currently belongs. Thus if the users had an inclination for a given interest sometime back, which has stopped recurring then such preference initially waits for the next few transaction and then starts fading off and finally goes away. Fig. 1. Let’s define the new algorithm now. We assume the data is in the chronological order, grouped by the user. Let there be U1, U2, Ui…UN user available. A particular user, denoted by Ui has moved through mi elements. Thus, the total number of records in the system is given by Define the thresholds set T = {{I},{E}} = {{TS}, {TN}, {TR}, {TO}, {TP}} for “Sporadic”, “New”, “Regular”, “Old” and “Past” category respectively, where each of the Ts contain the threshold for both inclusion and exclusion for the corresponding category. Define the weight function f(c) where c is the category in which the genre belongs currently. The h(c,w) is an activation function taking binary values (0 or 1) where w is the pre-defined window size for successive non-occurrence of any given attribute that currently belong to a particular category c (here Sporadic, New, Regular, Old and Past) after which the decaying function gets activated. Hence the decaying function d(c,w) is the product of f(c)*h(c,w). Initially, when an attribute comes into as a fresh interest for a given user, it’s given some weightage based on the frequency of occurrence. Once the cumulative weights cross the threshold for the starting category, which is “Sporadic”, such attribute of interest is tagged as “Sporadic”. From “Sporadic” the attribute can move to “New” category and then to the “Regular” category, each time satisfying the threshold for the given category (refer Fig-1). If the attribute stops recurring indicating a loss of interest or oversaturation, its weight starts decreasing by a certain percentage after a given waiting period (pre-defined window size) based on the category to which it belongs currently. The interest moves to the lower categories each time the threshold for a given category is breached and then finally moves out of the system. 3 An Example The following example illustrates our approach more clearly. Let us consider a user who rents movies (the element under consideration) in last 10 transactions. The following table shows how the attribute of interest (here movie genres) shift across the five sets for this user on the basis of his movie selection behavior. σ ݉݅ ே ୀଵ ൌ ܯ ൌ σ ܴ݆ ܯ ݆ൌͳ Fresh Interests ⎯threshold ⎯ → ⎯⎯ Sporadic ⎯threshold ⎯ → ⎯⎯ New ⎯threshold ⎯ → ⎯⎯ Regular Lost Interests ←⎯ ⎯ threshold ⎯⎯ Past ←⎯ ⎯ threshold ⎯⎯ Old ←⎯ ⎯ threshold ⎯⎯ Regular
Categorizing User Interests in Recommender Systems 287 Here f(c) 0.25for“ Regular =0.50for“old” and"New =0.75for“ sporadic”and"Past The activation function for decaying h(c, w) has the value 0 or 1 based on the follow ing pre-defined window size for successive non-occurrences of an already occurred attribute of interest for a given user. Here w =3for“oOld”and"New” h(Regular, 2)=0 while h("Sporadic, 2)=1. To exemplify, once a particular is into the"Regular"category, even if it doesnt occur for 3 successive times, it still remains in the said category. Only after the 4th consecutive non-occurrence, the weightage starts decaying by a value of 0. 25 which is the f(c) value for"Regular category. The attribute moves to the lower one on breaching threshold for"Regular The threshold set T= ITS, TN, TR, To, TP)=[0,0). 3, 3),5,5.( 2, 2). (-0.5 0.5)). In the current example, the inclusion threshold and the exclusion threshold are the same and the incremental weightage for occurrence of an attribute of interest is 1 transaction irrespective of interest category. All the above values have been estab lished experimentally for the given dataset. The values would depend based on the nature and content of the data Transaction New Thriller(1). Fiction/ ntasy(2), Adventure(3) /Family, Fantasy(1.25). Adventure(4) ction/ Adventure Sciene
Categorizing User Interests in Recommender Systems 287 Here f(c) = 0.25 for “Regular” = 0.50 for “Old” and “New” = 0.75 for “Sporadic” and “Past” The activation function for decaying h(c,w) has the value 0 or 1 based on the following pre-defined window size for successive non-occurrences of an already occurred attribute of interest for a given user. Here w = 4 for “Regular” = 3 for “Old” and “New” = 2 for “Sporadic” and “Past” Thus h(“Regular”,2) = 0 while h(“Sporadic”,2) = 1. To exemplify, once a particular genre is into the “Regular” category, even if it doesn’t occur for 3 successive times, it still remains in the said category. Only after the 4th consecutive non-occurrence, the weightage starts decaying by a value of 0.25 which is the f(c) value for “Regular” category. The attribute moves to the lower one on breaching threshold for “Regular”. The threshold set T = {TS, TN, TR, TO, TP} = {{0,0},{3,3),{5,5},{2,2},(-0.5,- 0.5}}. In the current example, the inclusion threshold and the exclusion threshold are the same and the incremental weightage for occurrence of an attribute of interest is 1 per transaction irrespective of interest category. All the above values have been established experimentally for the given dataset. The values would depend based on the nature and content of the data. Transaction Movie Genre Sporadic New Regular Old Past T1 Thriller, Science Fiction/ Fantasy, Action/ Adventure Thriller(1), Science Fiction/ Fantasy(1), Action/ Adventure(1) - - - - T2 Action/ Adventure, Science Fiction/ Fantasy, Thriller, Science Fiction/ Fantasy(2), Action/ Adventure(2) - - - - T3 Action/ Adventure Thriller(0.25), Science Fiction/ Fantasy(2) Action/ Adventure(3) - - - T4 Comedy, Romance, Kids/Family, Animation Science Fiction/ Fantasy(1.25), Comedy(1), Romance(1), Kids/Family(1), Animation(1) Action/ Adventure(3) - - - T5 Science Fiction/ Fantasy, Action/ Adventure Comedy(1), Romance(1), Kids/Family(1), Animation(1), Science Fiction/ Fantasy(2.25) Action/ Adventure(4) - - -
88 S Saha et al Action/ y0.25) venture, Romance(0. 25). Fantasy(. 25) Drama(1) Romance(. 25). Scienc Drama(D) Adventure(s 3.25 Drama(0.75) Adventure(s Adventure(s Romance(4.25) Action/ Fiction/ Adventure(4.75) 4 Previous work done Recommender systems are a widely researched area [ 3], prediction problem and secu- rity issues being the more popular issues worked upon. However, not much work has been done on the concept drift problem. Most of the previous works [4, 5, and 6]are mainly different variations of assigning weights to data. The basic approach used is to assign a greater weight to new data, and discount old data so as to decay or remove the effect of old data. In [4] decreasing weights are assigned to old data. In 5 the authors have proposed an item similarity based data weight in addition to time based data weight. They have added the new weights to handle the issue of reoccurrence of user interests. In [6] a new similarity measure is used and the authors propose using weights for items based on expected accuracy on the future preferences of the user. Most of the earlier works have formulated the problem as a prediction accuracy prob- lem. As a result, they have tried to measure the effectiveness of their approaches through accuracy metrics like mean absolute error or precision [7]. While these me- trics can capture the effectiveness of the algorithm to make rating prediction of un- seen items for a user, they cannot measure the capability of the algorithm to detect interest shift due to new interest. Unlike our approach, none of the existing approach es use genre data or content data to solve the problem of concept drift in recommend- er systems. And we believe that any approach that proposes to tackle the problem of detecting new interest has to use genre data or content data. 5 Experimental Evaluation We experimental evaluation our proposed approach on Yahoo Webscope R4 that has details of Yahoo Movie user ratings and Movie descriptive content. In total the train ing set had 174130 records, 6719 distinct users and an average movie viewership of 25.92 per user. The max no. of movies seen by a particular user was 1388 while the minimum was 1 To conduct the evaluation, we used the test-dataset once again provided by the Ya hoo Webscope R4. The test dataset had a total of 8358 records, 1675 distinct user
288 S. Saha et al. 4 Previous Work Done Recommender systems are a widely researched area [3], prediction problem and security issues being the more popular issues worked upon. However, not much work has been done on the concept drift problem. Most of the previous works [4, 5, and 6] are mainly different variations of assigning weights to data. The basic approach used is to assign a greater weight to new data, and discount old data so as to decay or remove the effect of old data. In [4] decreasing weights are assigned to old data. In [5] the authors have proposed an item similarity based data weight in addition to time based data weight. They have added the new weights to handle the issue of reoccurrence of user interests. In [6] a new similarity measure is used and the authors propose using weights for items based on expected accuracy on the future preferences of the user. Most of the earlier works have formulated the problem as a prediction accuracy problem. As a result, they have tried to measure the effectiveness of their approaches through accuracy metrics like mean absolute error or precision [7]. While these metrics can capture the effectiveness of the algorithm to make rating prediction of unseen items for a user, they cannot measure the capability of the algorithm to detect interest shift due to new interest. Unlike our approach, none of the existing approaches use genre data or content data to solve the problem of concept drift in recommender systems. And we believe that any approach that proposes to tackle the problem of detecting new interest has to use genre data or content data. 5 Experimental Evaluation We experimental evaluation our proposed approach on Yahoo Webscope R4 that has details of Yahoo Movie user ratings and Movie descriptive content. . In total the training set had 174130 records, 6719 distinct users and an average movie viewership of 25.92 per user. The max no. of movies seen by a particular user was 1388 while the minimum was 1. To conduct the evaluation, we used the test-dataset once again provided by the Yahoo Webscope R4. The test dataset had a total of 8358 records, 1675 distinct users T6 Action/ Adventure, Drama, Science Fiction/ Fantasy Comedy(0.25), Romance(0.25), Kids/Family, Animation, Drama(1) Science Fiction/ Fantasy(3.25) Action/ Adventure (5) - - T7 Romance Romance(1.25), Drama(1) Science Fiction/ Fantasy(3.25) Action/ Adventure(5) - - T8 Romance Romance(2.25), Drama(0.75) Science Fiction/ Fantasy(3.25) Action/ Adventure(5) - - T9 Romance Science Fiction/ Fantasy(2.75) Romance(3.25) Action/ Adventure(5) - - T10 Romance Science Fiction/ Fantasy(2) Romance(4.25) - Action/ Adventure(4.75) -
Categorizing User Interests in Recommender Systems 289 and an average of 5 records(4.9899 to be exact)per user. The min test-record for a single user was I while the max was 145. We simply evaluated the test-data from the tables constructed with the training-data. If the"Regular, " New"and"Old"category had interests that could be successfully matched to the interest for the given user in the test-dataset, then we say a hit has been made or else we say it's a miss. The "hit ratio, expressed as a percentage came out as 85.0009%. In other words, in about 85% cases, the user's movie preferences predic- tions matched that computed from the genre preferences as per the given algorithm. To benchmark our algorithm, the closest match we found was in the "Sliding win dow"technique where items that appeared in the last"n"transactions are all consi- dered for the sake of prediction. We have considered the same in our results. We have also had the individual movie genres computed as a percentage of the total number of transactions and used the more frequently occurring genres to predict the forthcoming movies. They are discussed in more details in the results section 6 Results and discussions To test the results, we have benchmarked our algorithm against the"Sliding Win- dow. These lists of genres that occurred in the last"n"transactions have been consi- dered to check against the genres of the movies that have been provided in the test dataset. Following is the result Size of sliding Window Hit ratio as a percentage(rounded 3 53% 68% Thus with a sliding window encompassing last 5 movies, the prediction accuracy was computed as 68%0. We grouped the genres into Very Frequent, Frequent and Least Frequent based on percentage of times they appear in the user's transactions Result follows Serial no Hit Ratio %o Threshold hreshold ercen Percent 50 25 1469% 28.14% 16 43.20% 38.56% 50 10 58.35% We find that our algorithm produces a much more refined results compared to the ones described above
Categorizing User Interests in Recommender Systems 289 and an average of 5 records (4.9899 to be exact) per user. The min test-record for a single user was 1 while the max was 145. We simply evaluated the test-data from the tables constructed with the training-data. If the “Regular”, “New” and “Old” category had interests that could be successfully matched to the interest for the given user in the test-dataset, then we say a hit has been made or else we say it’s a miss. The “hit ratio”, expressed as a percentage came out as 85.0009%. In other words, in about 85% cases, the user’s movie preferences predictions matched that computed from the genre preferences as per the given algorithm. To benchmark our algorithm, the closest match we found was in the “Sliding Window” technique where items that appeared in the last “n” transactions are all considered for the sake of prediction. We have considered the same in our results. We have also had the individual movie genres computed as a percentage of the total number of transactions and used the more frequently occurring genres to predict the forthcoming movies. They are discussed in more details in the results section. 6 Results and Discussions To test the results, we have benchmarked our algorithm against the “Sliding Window”. These lists of genres that occurred in the last “n” transactions have been considered to check against the genres of the movies that have been provided in the test dataset. Following is the result: Size of Sliding Window Hit ratio as a percentage (rounded values) 3 53% 4 61% 5 68% Thus with a sliding window encompassing last 5 movies, the prediction accuracy was computed as 68%. We grouped the genres into Very Frequent, Frequent and Least Frequent based on percentage of times they appear in the user’s transactions. Result follows: Serial No. Very-Frequent Threshold Percent Frequent Threshold Percent Least-Frequent Threshold Percent Hit Ratio % 1 75 50 25 14.69% 2 70 40 15 28.14% 3 66 33 16 43.20% 4 60 35 20 38.56% 5 50 25 10 58.35% We find that our algorithm produces a much more refined results compared to the ones described above
S. Saha et al We observed that users that had at least 20 records in the training set could satis factorily have their interests categorized into the well-defined sets. We computed the attribute cardinality in the sliding window pane of size 5 and the number of attributes in the combined category of"Regular", "Old"and"New"from our algorithm. On overage, our algorithm came up with 4.98 elements while the sliding window had 4.65 elements in its preference set. It might apparently seem that the accuracy of the algorithm has been enhanced by the extra elements compared to the sliding window However, it's interesting to note that an increase in about 6% of the attribute size increases the prediction accuracy by 25%0 We raise a few concerns here as well. We have worked only with a single decom- sable characteristic of the dataset. In most practical situations, we may lose out important dimensions if we work only with a single attribute. Hence, the algorithm needs to be extended to take care of cases where the dataset has multiple reference element-set and non-homogenous attributes. The thresholds for transition and the window size of the activation function and have been experimentally evaluated in this particular example. A generalization tech- e for determining such limiting value is yet to be found out ye have not considered the rating of the movie, the profile of the users- which could have otherwise given some newer findings. If sufficient data is available, one can take into account the user movie watching habit as well that would enable to have different threshold, different window size for different types of users in same dataset Having said all these, the beauty of the algorithm is its simplicity and yet the ro- bustness. The algorithm is computationally not very intensive and requires very little resource in terms of both space and time. In most real world phenomenon, we judge an object by a unique distinguishing attribute only so the algorithm might find tre mendous use with homogenous multi-attribute element for a given dataset. 7 Conclusion In this paper, we propose an approach towards categorizing user interests in a recom- mender system by analyzing users'previous transactions. Every user's interests are categorized into one of the five categories or sets namely sporadic interests, new interests, regular interests, old interests and past interests. The motivation behind our approach is provide a solution for the problem of concept drift. We are more in terested in tracking a user's changing interests rather than going for plain accuracy We would like to extend our algorithm to take into account the issues discussed in section 5 and apply the same on popular research dataset like that of Netflix. The strength of our algorithm is ease of understanding and simplicity in construction. We would look forward to extensions in the similar direction but encompassing much more complexity in our future works. References Herlocker, J, Konstan, J. Borchers, A, Riedl, J: An Algorithm Framework for Perform- ing Collaborative Filtering. In: Proceedings of SIGIR, pp 77-87. ACM, New York(1999) 2. Lam, S, Riedl, J. Shilling Recommender Systems for Fun and Profit. In: Proceedings of the 13th Intermational Www Conference. New York(2004)
290 S. Saha et al. We observed that users that had at least 20 records in the training set could satisfactorily have their interests categorized into the well-defined sets. We computed the attribute cardinality in the sliding window pane of size 5 and the number of attributes in the combined category of “Regular”, “Old” and “New” from our algorithm. On average, our algorithm came up with 4.98 elements while the sliding window had 4.65 elements in its preference set. It might apparently seem that the accuracy of the algorithm has been enhanced by the extra elements compared to the sliding window. However, it’s interesting to note that an increase in about 6% of the attribute size increases the prediction accuracy by 25%. We raise a few concerns here as well. We have worked only with a single decomposable characteristic of the dataset. In most practical situations, we may lose out important dimensions if we work only with a single attribute. Hence, the algorithm needs to be extended to take care of cases where the dataset has multiple reference element-set and non-homogenous attributes. The thresholds for transition and the window size of the activation function and have been experimentally evaluated in this particular example. A generalization technique for determining such limiting value is yet to be found out. We have not considered the rating of the movie, the profile of the users – which could have otherwise given some newer findings. If sufficient data is available, one can take into account the user movie watching habit as well that would enable to have different threshold, different window size for different types of users in same dataset. Having said all these, the beauty of the algorithm is its simplicity and yet the robustness. The algorithm is computationally not very intensive and requires very little resource in terms of both space and time. In most real world phenomenon, we judge an object by a unique distinguishing attribute only so the algorithm might find tremendous use with homogenous multi-attribute element for a given dataset. 7 Conclusion In this paper, we propose an approach towards categorizing user interests in a recommender system by analyzing users’ previous transactions. Every user’s interests are categorized into one of the five categories or sets namely sporadic interests, new interests, regular interests , old interests and past interests. The motivation behind our approach is provide a solution for the problem of concept drift. We are more interested in tracking a user’s changing interests rather than going for plain accuracy. We would like to extend our algorithm to take into account the issues discussed in section 5 and apply the same on popular research dataset like that of Netflix. The strength of our algorithm is ease of understanding and simplicity in construction. We would look forward to extensions in the similar direction but encompassing much more complexity in our future works. References 1. Herlocker, J., Konstan, J., Borchers, A., Riedl, J.: An Algorithm Framework for Performing Collaborative Filtering. In: Proceedings of SIGIR, pp. 77–87. ACM, New York (1999) 2. Lam, S., Riedl, J.: Shilling Recommender Systems for Fun and Profit. In: Proceedings of the 13th International WWW Conference, New York (2004)
Proceedings of the Federated Int. Conf. on the Move to Meaningful Interne吃三 Categorizing User Interests in Recommender Systems 3. Massa, P, Avesani, P. Trust-aware collaborative filtering for recommender syst 4. Massa, P, Avesani, P: Trust-aware recommender systems. In: Proceedings of the 2007 ACM Conference on Recommender Systems, Minneapolis (2007) 5. Massa, P, Avesani, P: Trust-aware Bootstrapping of Recommender Systems. In: Pro- eedings of ECAl Workshop on Recommender Systems, Italy(2006) 6. Massa, P, Avesani, P: Trust metrics on controversial users: balancing between tyranny of the majority and echo chambers. International Journal on Semantic Web and Information Systems 7. Mobasher, B, Burke, R, Bhaumik, R, Williams, C: Towards Trustworthy Recommender Systems: An Analysis of Attack Models and Algorithm Robustness. ACM Transactions on Internet Technology 7, 23: 1-23: 38(2007) 8. Rashid, A, Karypis, G, Riedl, J: Learning Preferences of New Users in Recommender Systems: An Information Theoretic Approach. SIGKDD Explorations 10, 90-100(2008) 9. Sarwar, B, Karypis, G, Konstan, J, Riedl, J: Item-based Collaborative Filtering of Rec mmendation Algorithms. In: Proceedings of the 1Oth International Www Conference. 10. Victor, P, Cornelis, C, Cock, M, Teredesai, A: Key figure impact in trust-enhanced re- commander systems. Al Communications 21(2-3), 127-143(2008)
Categorizing User Interests in Recommender Systems 291 3. Massa, P., Avesani, P.: Trust-aware collaborative filtering for recommender systems. In: Proceedings of the Federated Int. Conf. on the Move to Meaningful Internet (2004) 4. Massa, P., Avesani, P.: Trust-aware recommender systems. In: Proceedings of the 2007 ACM Conference on Recommender Systems, Minneapolis (2007) 5. Massa, P., Avesani, P.: Trust-aware Bootstrapping of Recommender Systems. In: Proceeedings of ECAI Workshop on Recommender Systems, Italy (2006) 6. Massa, P., Avesani, P.: Trust metrics on controversial users: balancing between tyranny of the majority and echo chambers. International Journal on Semantic Web and Information Systems 7. Mobasher, B., Burke, R., Bhaumik, R., Williams, C.: Towards Trustworthy Recommender Systems: An Analysis of Attack Models and Algorithm Robustness. ACM Transactions on Internet Technology 7, 23:1–23:38 (2007) 8. Rashid, A., Karypis, G., Riedl, J.: Learning Preferences of New Users in Recommender Systems: An Information Theoretic Approach. SIGKDD Explorations 10, 90–100 (2008) 9. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based Collaborative Filtering of Recommendation Algorithms. In: Proceedings of the 10th International WWW Conference, Hong Kong (2001) 10. Victor, P., Cornelis, C., Cock, M., Teredesai, A.: Key figure impact in trust-enhanced recommender systems. AI Communications 21(2-3), 127–143 (2008)