Incorporating User Control into Recommender Systems Based on Naive Bayesian Classification Verus pronk Wim Verhaegh hilips research Europe hilips Research Europe High Tech Campus 34 High Tech Campus 11 656 AE Eindhoven The Netherlands The Netherlands verus.pronk@philips.comwim.verhaegh@philips.com Adolf Proidl Marco Tiemann Philips Research Europe Philips Research Europe Gutheil-Schoder-Gasse 10 High Tech Campus 34 A-1100 Vienn 5656 AE Eindhoven rIa The netherlands adolf proidl@pl hilips.commarcotiemann@philips.com ABSTRACT 1. INTRODUCTION ecommender systems are increasingly being employed to person The use of recommender technology is steadily being introduced such as personal video recorders. These recommenders learn a user offer a recommender to aid users in finding the movies they profile, based on rating feedback from the user on, e. g, book and personal video recorders like the Tivo box use a recommer songs, or TV programs, and use machine learning techniques to for automatic recording of the user's favorite movies infer the ratings of new items. A recommender learns the taste of a user based on ratings that The techniques commonly used are collaborative filtering and the user supplies on items, such as movies. These ratings are typi naive Bayesian classification, and they are known to have several cally a positive-negative classification, indicating like and dislike, problems, in particular the cold-start problem and its slow adaptiv- respectively, or a more elaborate classification into a range of like ity to changing user preferences. These problems can be mitigated degrees. As such, this is the interface by which the user teaches the by allowing the user to set up or manipulate his profile. recommender about his taste. This learning process is inherently B, In this paper, we propose an extension to the naive Bayesian clas- slow in the sense that the user has to rate a considerable number of ier that enhances user control. We do this by maintaining and items before the recommender can make sensible suggestions. This flexibly integrating two profiles for a user, one learned by rating leads to the well-known cold-start problem and to a slow adaptabil feedback, and one created by the user. We in particular show how ity of the recommender to changing user preferences. The former the cold-start problem is mitigated problem pertains to the situation that the user has only rated a few Categories and Subject Descriptors Although the cold-start problem can be mitigated by setting up a rating session wherein the user rates a sufficient number of items. G3 [ Probability and Statistics ] Probabilistic algorithms(includ the burden this places upon the user is generally considered unac- ing Monte Carlo); H.3.3 [Information Search and Retrieval]: In- ceptable. Furthermore this is not an elegant way to deal with a formation filtering sudden change of interest. For example, if the user discovers that he likes a particular actor or director, having to rate many movies General terms before the recommender has learned this clearly indicates that this rating interface is inappropriate to deal with such situations. The Algorithms user should have more direct control over the recommender to able to resolve these situations Keywords Naive Bayesian classification(NBC) lends itself well to this type classification, machine learning, naive Bayes, recommender, user of user control because often, the profile it uses is quite intuitive. profile, user control, multi-valued features A movie is described by a number of feature-value pairs, such as (channel, Hallmark Movie Channel),(actor, Clint Eastwood), or (start time, 20: 00), and the profile consists of positive and negative counts for individual feature-value pairs. From these figures, like- Permission to make digital or hard copies of all or part of this work for degrees can be computed, at the level of feature-value pairs rather than at the level of individual movies. The user can similarly create not made or distributed for profit or co a profile by defining like-degrees for feature-value pairs, so that bear this notice and the full citation on the fi We propose to extend nbC by incorporating a user-defined pro- 107, October 19-20, 2007, Minneapolis, Minnesota, US file and flexibly combining it with the learned profile. This flexibil- Copyright2007ACM978-1-59593-730-8/07/0010.S5.00 ity is such that, initially, the recommender operates based solely
Incorporating User Control into Recommender Systems Based on Naive Bayesian Classification Verus Pronk Philips Research Europe High Tech Campus 34 5656 AE Eindhoven The Netherlands verus.pronk@philips.com Wim Verhaegh Philips Research Europe High Tech Campus 11 5656 AE Eindhoven The Netherlands wim.verhaegh@philips.com Adolf Proidl Philips Research Europe Gutheil-Schoder-Gasse 10 A-1100 Vienna Austria adolf.proidl@philips.com Marco Tiemann Philips Research Europe High Tech Campus 34 5656 AE Eindhoven The Netherlands marco.tiemann@philips.com ABSTRACT Recommender systems are increasingly being employed to personalize services, such as on the web, but also in electronics devices, such as personal video recorders. These recommenders learn a user profile, based on rating feedback from the user on, e.g., books, songs, or TV programs, and use machine learning techniques to infer the ratings of new items. The techniques commonly used are collaborative filtering and naive Bayesian classification, and they are known to have several problems, in particular the cold-start problem and its slow adaptivity to changing user preferences. These problems can be mitigated by allowing the user to set up or manipulate his profile. In this paper, we propose an extension to the naive Bayesian classifier that enhances user control. We do this by maintaining and flexibly integrating two profiles for a user, one learned by rating feedback, and one created by the user. We in particular show how the cold-start problem is mitigated. Categories and Subject Descriptors G.3 [Probability and Statistics]: Probabilistic algorithms (including Monte Carlo); H.3.3 [Information Search and Retrieval]: Information filtering General Terms Algorithms Keywords classification, machine learning, naive Bayes, recommender, user profile, user control, multi-valued features Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. RecSys'07, October 19–20, 2007, Minneapolis, Minnesota, USA. Copyright 2007 ACM 978-1-59593-730-8/07/0010 ...$5.00. 1. INTRODUCTION The use of recommender technology is steadily being introduced into the market. Web sites such as Yahoo! Movies and MovieLens offer a recommender to aid users in finding the movies they like, and personal video recorders like the Tivo box use a recommender for automatic recording of the user’s favorite movies. A recommender learns the taste of a user, based on ratings that the user supplies on items, such as movies. These ratings are typically a positive−negative classification, indicating like and dislike, respectively, or a more elaborate classification into a range of likedegrees. As such, this is the interface by which the user teaches the recommender about his taste. This learning process is inherently slow in the sense that the user has to rate a considerable number of items before the recommender can make sensible suggestions. This leads to the well-known cold-start problem and to a slow adaptability of the recommender to changing user preferences. The former problem pertains to the situation that the user has only rated a few items. Although the cold-start problem can be mitigated by setting up a rating session wherein the user rates a sufficient number of items, the burden this places upon the user is generally considered unacceptable. Furthermore, this is not an elegant way to deal with a sudden change of interest. For example, if the user discovers that he likes a particular actor or director, having to rate many movies before the recommender has learned this clearly indicates that this rating interface is inappropriate to deal with such situations. The user should have more direct control over the recommender to be able to resolve these situations. Naive Bayesian classification (NBC) lends itself well to this type of user control because often, the profile it uses is quite intuitive. A movie is described by a number of feature-value pairs, such as (channel, Hallmark Movie Channel), (actor, Clint Eastwood), or (start time, 20:00), and the profile consists of positive and negative counts for individual feature-value pairs. From these figures, likedegrees can be computed, at the level of feature-value pairs rather than at the level of individual movies. The user can similarly create a profile by defining like-degrees for feature-value pairs, so that these profiles can be combined. We propose to extend NBC by incorporating a user-defined pro- file and flexibly combining it with the learned profile. This flexibility is such that, initially, the recommender operates based solely on 73
the user-defined profile and, as the recommender learns, the learned The factors Pr(c(x)=j)are called prior probabilities, the factors profile gradually replaces the user-defined profile. However, when Pr(xi=yi c(x)=j)conditional probabilities, and the expression the user adapts his profile, the adapted part temporarily takes over Pr(cly)=j are called the posterior probabilities The general approach in NBC is that the prior and conditional The remainder of this paper is organized as follows. After briefly obabilities are estimated using the training data to obtain esti- describing related work in Section 2, we revisit naive Bayesian mates of the posterior probabilities. We define the learned profile classification in Section 3. Then, in Section 4, we explain how as follows. For each feature i, each value v E Di, and each class j, to incorporate a user-defined profile into the classifier. In Section 5 we generalize this to multi-valued features, where a feature can at- N()=|x∈xlcx=jand tain a number of values simultaneously results, providing a proof of concept for the recommender, in Sec- N(iv,j=x∈X|x=v∧cx= tion 6. We make some concluding remarks in Section 7 ity of a set S. By assuming, with- out loss of generality, that for each j, N(>O. we estimate the 2. RELATED WORK Several approaches have been proposed in the literature to miti- P(c(x)=j)≈Nj)/xand gate the cold-start problem. In 16], the authors propose to use pre computed stereotypes from which a user can choose some to jump- P(x=y|c(x)=)≈N(,n (4) start a recommender. A stereotype is a possibly large set of item V programs in their paper, that are similar to each other, where By substituting these estimates into Equation 1 we obtain similarity is defined using the modified value-difference metric( see timate of the probability that y belongs to class j in terms of the [2],[10). The precomputation uses rating histories of a number of training data. users and uses a standard clustering algorithm In case N(i, ,i, j)=0 in Equation 4, a technique called Laplace An alternative approach is to combine a user-defined recom orrection, see [1], is used to prevent the conditional probability mender with a learning recommender by combining their outputs estimate and corresponding posterior probability estimate to be- propose to use a neural network for fusing the outputs. User con- come O.For These are called hybrid recommender systems. in[ 12 the authors now, we will ass zero counts trol beyond incorporating a user-defined recommender is, however, The naive Bayes' classification cy) of y is defined as the value of j that maximizes the estimate. Ties are broken arbitrarily. For In [91, the author suggests an alternative to MovieLens, called mally, a(y)is defined as DynamicLens, to aid the user in providing a recommender system th user-defined ratings to a meta-recommender system. The in erface is in particular geared towards enhancing user control in (y)= arg max化 hybrid recommender systems. The aim of this paper is to integrate a user-defined profile and a If E()+c(), then we speak of a classification error. The classifi- learned profile into a single recommender, rather than multi- cation error rate e, or error rate for short, is defined as ple recommenders, offering direct user control over recommended E=Pr(c(x)≠c(x) r. Here. x is 3. NAIVE BAYESIAN CLASSIFICATION and is a measure for the performance of the again a randomly chosen instance. The cla next describe NBC in detail. starting with some notation defined as 1-E. The definition of error refined by An instance x is described by f feature values xi E D, for each considering class-conditional error rates. Given a class j, we define 1, 2, ...,, where Di is the domain of feature i. Its class is denoted by c(x)EC, where C is the set of classes. For the moment, E,=P(x)≠c(x)|c(x)= do not consider missing features, but return to this shortly as the class-j error rate. The class-conditional classification accu- Given is a non-empty set x of training instances and for each racy is given by 1-E instance r E x its class cr =c(x). Let y be an instance to be clas- This summarizes the classical approach towards naive Bayesian sified. The approach in NBC is that we express Pr(c(y)=j), for classification, see also[7. Some remarks, however, are worth mak each jE C, in terms of the training data. ing. First, Equations 3 and 4 can indeed be used as estimates of the Let x be a random variable on the domain al of instances. Us- prior and conditional probabilities, respectively, if the training set ing Bayes ' rule and assuming conditional independence of feature values for a given class, we can rephrase Pr(c(y)=j)as follows consists of instances chosen randomly on the instance space ul. In practice, this may not be the case. For example, if the prior prol Pr(c()=j)= Pr(c(r)=jlx=y) abilities are heavily skewed, then this would require a relatively large training set to obtain a sufficient number of classj instances Pr(c(r)=j) Pr(r=ylc(r)=j) in the training set, for those values of j for which the prior proba- bilities are relatively small, to obtain sufficiently reliable estimates for the conditional probabilities. It seems therefore reasonable to (c(=j)I= Pr(xi =yi Ic()=). (1) Equation 3 does not generally hold, and proper values for the prior keep the values of N( j)approximately equal. In this case, however. probabilities should be obtained in a different way. Also in other As the denominator can alternatively be written as the sum over all cases, such as in a recommender system, where it may be largely j of the numerator, it serves as a normalization constant. When up to the user which instances end up in the training set, Equation 3 omparing probabilities, this constant can be omitted may be invalid
the user-defined profile and, as the recommender learns, the learned profile gradually replaces the user-defined profile. However, when the user adapts his profile, the adapted part temporarily takes over again. The remainder of this paper is organized as follows. After briefly describing related work in Section 2, we revisit naive Bayesian classification in Section 3. Then, in Section 4, we explain how to incorporate a user-defined profile into the classifier. In Section 5 we generalize this to multi-valued features, where a feature can attain a number of values simultaneously. We report on performance results, providing a proof of concept for the recommender, in Section 6. We make some concluding remarks in Section 7. 2. RELATED WORK Several approaches have been proposed in the literature to mitigate the cold-start problem. In [6], the authors propose to use precomputed stereotypes from which a user can choose some to jumpstart a recommender. A stereotype is a possibly large set of items, TV programs in their paper, that are similar to each other, where similarity is defined using the modified value-difference metric (see [2], [10]). The precomputation uses rating histories of a number of users and uses a standard clustering algorithm. An alternative approach is to combine a user-defined recommender with a learning recommender by combining their outputs. These are called hybrid recommender systems. in [12], the authors propose to use a neural network for fusing the outputs. User control beyond incorporating a user-defined recommender is, however, lacking. In [9], the author suggests an alternative to MovieLens, called DynamicLens, to aid the user in providing a recommender system with user-defined ratings to a meta-recommender system. The interface is in particular geared towards enhancing user control in hybrid recommender systems. The aim of this paper is to integrate a user-defined profile and a learned profile into a single recommender, rather than using multiple recommenders, offering direct user control over recommended items. 3. NAIVE BAYESIAN CLASSIFICATION We next describe NBC in detail, starting with some notation. An instance x is described by f feature values xi ∈ Di , for each i = 1,2,..., f , where Di is the domain of feature i. Its class is denoted by c(x) ∈ C, where C is the set of classes. For the moment, we do not consider missing features, but return to this shortly. Given is a non-empty set X of training instances and for each instance x ∈ X its class cx = c(x). Let y be an instance to be classified. The approach in NBC is that we express Pr(c(y) = j), for each j ∈ C, in terms of the training data. Let x be a random variable on the domain U of instances. Using Bayes’ rule and assuming conditional independence of feature values for a given class, we can rephrase Pr(c(y) = j) as follows. Pr(c(y) = j) = Pr(c(x) = j | x = y) = Pr(c(x) = j) Pr(x = y | c(x) = j) Pr(x = y) = Pr(c(x) = j) ∏ f i=1 Pr(xi = yi | c(x) = j) Pr(x = y) . (1) As the denominator can alternatively be written as the sum over all j of the numerator, it serves as a normalization constant. When comparing probabilities, this constant can be omitted. The factors Pr(c(x) = j) are called prior probabilities, the factors Pr(xi = yi | c(x) = j) conditional probabilities, and the expressions Pr(c(y) = j) are called the posterior probabilities. The general approach in NBC is that the prior and conditional probabilities are estimated using the training data to obtain estimates of the posterior probabilities. We define the learned profile as follows. For each feature i, each value v ∈ Di , and each class j, N(j) = |{x ∈ X | cx = j}| and N(i, v, j) = |{x ∈ X | xi = v∧cx = j}|, (2) where |S| denotes the cardinality of a set S. By assuming, without loss of generality, that for each j, N(j) > 0, we estimate the probabilities as Pr(c(x) = j) ≈ N(j)/|X| and (3) Pr(xi = yi | c(x) = j) ≈ N(i, yi , j) N(j) . (4) By substituting these estimates into Equation 1 we obtain an estimate of the probability that y belongs to class j in terms of the training data. In case N(i, yi , j) = 0 in Equation 4, a technique called Laplace correction, see [1], is used to prevent the conditional probability estimate and corresponding posterior probability estimate to become 0. For now, we will assume that we do not have any so-called zero counts. The naive Bayes’ classification c˜(y) of y is defined as the value of j that maximizes the estimate. Ties are broken arbitrarily. Formally, c˜(y) is defined as c˜(y) = arg max j∈C N(j) |X| f ∏ i=1 N(i, yi , j) N(j) . (5) If c˜(y) 6= c(y), then we speak of a classification error. The classifi- cation error rate E, or error rate for short, is defined as E = Pr(c˜(x) 6= c(x)), (6) and is a measure for the performance of the classifier. Here, x is again a randomly chosen instance. The classification accuracy is defined as 1 − E. The definition of error rate can be refined by considering class-conditional error rates. Given a class j, we define Ej = Pr(c˜(x) 6= c(x) | c(x) = j) (7) as the class-j error rate. The class-conditional classification accuracy is given by 1−Ej . This summarizes the classical approach towards naive Bayesian classification, see also [7]. Some remarks, however, are worth making. First, Equations 3 and 4 can indeed be used as estimates of the prior and conditional probabilities, respectively, if the training set consists of instances chosen randomly on the instance space U. In practice, this may not be the case. For example, if the prior probabilities are heavily skewed, then this would require a relatively large training set to obtain a sufficient number of class-j instances in the training set, for those values of j for which the prior probabilities are relatively small, to obtain sufficiently reliable estimates for the conditional probabilities. It seems therefore reasonable to keep the values of N(j) approximately equal. In this case, however, Equation 3 does not generally hold, and proper values for the prior probabilities should be obtained in a different way. Also in other cases, such as in a recommender system, where it may be largely up to the user which instances end up in the training set, Equation 3 may be invalid. 74
For a two-class recommender, [5] suggests to set the prior prob- ranges from 0 to 1, I itself excluded. This I(i, v) can be interpreted abilities so as to balance the class-conditional error rates, an ap as a like-degree for feature-value pair(i, v), where 0.5 corresponds roach that is essentially the same a suggested in [4], where the to neutral, as it leads to the neutral skewing factor of 1. Further author uses a cost-based approach towards classification. We adopt more, two non-zero like-degrees I and 1-1 lead to the skewing the same approach, in that the priors are set to predefined value factors of 1/(1-1)and(1-1)/l, respectively, which cancel each Pj for each jE C, based on criteria related to the class-conditional other when multiplied togethe error rates This forms the basis for the incorporation of a user-defined pro- The second remark pertains to missing features and features file/uOn the one hand, the like-degree I'(i, v)is learned from the multiple values. We will return to the latter in Se training set using the learned skewing factor r(i, v), defined as feature is missing in an instance to be classif corresponding factor from Equation 5. To deal wi r N(i,v,+)/N(i,+) tures in instances from the training set, we adapt the estimate of the N(i,v,-)/N(i,-) (14) conditional probability, given in Equation 4 to hich estimates the skewing factor given by equation 10. On the Pr(x=y|c(x)=)≈N(y,n other hand, for each feature-value pair(i, v), there is a user-defined N(i,j) like-degree /(i, v). Its default value is 0.5. cetral but the user can set this like-degree to any value in the range [0, 1). Using this where N(i, j)=EveD, N(i, v,j). Thus, N(i, j) counts for each j default value enables the user to easily create a profile by only the number of instances in the training set that do have a value for ting a few like-degrees The user-defined and learned like-degrees are integrated as fol- The definition of c in Equation 5 is thus replaced by lows. We define the integrated like-degree /nt (i, v)as N(i,yi,j) r(;y)=a(i,y)+(1-a)/(i), (15) with a 0, 1. This /(i, v) is next used to calculate its corre- nding integrated skewing factor r nt (i, v)using the inverse func- 4. INCORPORATING A USER-DEFINED tion x/(1-x) of x/(1+x), which is used in Equation 13. This PROFILE skewing factor replaces r(i, v)in Equation 12. The goal in this section is to integrate a user-defined profile with To generate a recommender that starts off as one the learned profile, given by Equation 2. We do this by defining defined profile and gradually turns into a learnin like-degrees for individual feature values based only on the learned profile, this a should preferably be made We assume that there are two classes, i. e, positive (+)and ne dependent on the size of the training set. The thick, solid line ative(-). We also simplify notation by omitting, where possible, Figure 1 illustrates a possible definition, where the horizontal axis xplicit reference to the random variable x. In particular, we ab- denotes the training set size. breviate'c()=j'to'j'and'x=y to 'y'. Using Equation 1 and predefined prior probabilities p+ and p_, we derive that p+ yi|+) Pr y|-) orresponding to the definition of c in Equation 8, if the estimate of the right-hand side of this equation is larger than l y is classified as positive. if it is sm I, y is classified as negative, and if it equals one, a random is made 0 K L We next define r(i, v)= Pr(xi=v+ (10) Figure 1: Possible dependency of a on the training set size S (thick, solid line) and a possible trajectory of a when an update Pr(+|x=v)/P(-|x=v) at So in the user-defined profile and two pruning actions are (11) performed(thin, grey lines). as the skewing factor for feature-value pair(i, v). As shown by Initially, that is, until the training set size has a specific Equation ll, it indicates the relative skew that v causes for feature he user-defined profile should be active only to allow the i in the prior probabilities. Using Equation 10, Equation 9 can be profile to mature somewhat. Otherwise, the user-defined d profile would immediately be contaminated with unreliable data. Then, P(+12)=P1(y the learned profile gradually takes over as the training set size in- (12) creases, until it is sufficiently large, indicated by the limit L,at which point the learned profile has completely taken over. There The involved skewing factors are thus multiplied together. Not is, of course, ample freedom in choosing how a depends on S. The that a non-zero skewing factor r is canceled by its inverse 1/r. near relationship is chosen here for simplicity Where r(i, v) ranges from 0 to oo, the derived quantity I(i, v). This can be refined by making a dependent on i, the feature un- defined as der consideration in Equation 15. In particular, Nmin (i), defined 14)=r以) 1+r(i Nmin (i)= min (N(, -) N(i, +)
For a two-class recommender, [5] suggests to set the prior probabilities so as to balance the class-conditional error rates, an approach that is essentially the same a suggested in [4], where the author uses a cost-based approach towards classification. We adopt the same approach, in that the priors are set to predefined values pj for each j ∈ C, based on criteria related to the class-conditional error rates. The second remark pertains to missing features and features with multiple values. We will return to the latter in Section 5. Whenever a feature is missing in an instance to be classified, we omit the corresponding factor from Equation 5. To deal with missing features in instances from the training set, we adapt the estimate of the conditional probability, given in Equation 4 to Pr(xi = yi | c(x) = j) ≈ N(i, yi , j) N(i, j) , where N(i, j) = ∑v∈Di N(i, v, j). Thus, N(i, j) counts for each j the number of instances in the training set that do have a value for feature i. The definition of c˜ in Equation 5 is thus replaced by c˜(y) = arg max j∈C pj f ∏ i=1 N(i, yi , j) N(i, j) . (8) 4. INCORPORATING A USER-DEFINED PROFILE The goal in this section is to integrate a user-defined profile with the learned profile, given by Equation 2. We do this by defining like-degrees for individual feature values. We assume that there are two classes, i.e., positive (+) and negative (−). We also simplify notation by omitting, where possible, explicit reference to the random variable x. In particular, we abbreviate ‘c(x) = j’ to ‘j’ and ‘x = y’ to ‘y’. Using Equation 1 and predefined prior probabilities p+ and p−, we derive that Pr(+ | y) Pr(− | y) = p+ p− f ∏ i=1 Pr(xi = yi | +) Pr(xi = yi | −) . (9) Corresponding to the definition of c˜ in Equation 8, if the estimate of the right-hand side of this equation is larger than 1, y is classified as positive, if it is smaller than 1, y is classified as negative, and if it equals one, a random choice is made. We next define r(i, v) = Pr(xi = v | +) Pr(xi = v | −) (10) = Pr(+ | xi = v)/Pr(− | xi = v) Pr(+)/Pr(−) (11) as the skewing factor for feature-value pair (i, v). As shown by Equation 11, it indicates the relative skew that v causes for feature i in the prior probabilities. Using Equation 10, Equation 9 can be rephrased as Pr(+ | y) Pr(− | y) = p+ p− f ∏ i=1 r(i, yi). (12) The involved skewing factors are thus multiplied together. Note that a non-zero skewing factor r is canceled by its inverse 1/r. Where r(i, v) ranges from 0 to ∞, the derived quantity l(i, v), defined as l(i, v) = r(i, v) 1+r(i, v) , (13) ranges from 0 to 1, 1 itself excluded. This l(i, v) can be interpreted as a like-degree for feature-value pair (i, v), where 0.5 corresponds to neutral, as it leads to the neutral skewing factor of 1. Furthermore, two non-zero like-degrees l and 1 − l lead to the skewing factors of l/(1 − l) and (1 − l)/l, respectively, which cancel each other when multiplied together. This forms the basis for the incorporation of a user-defined pro- file l u . On the one hand, the like-degree l l (i, v) is learned from the training set using the learned skewing factor r l (i, v), defined as r l (i, v) = N(i, v,+)/N(i,+) N(i, v,−)/N(i,−) , (14) which estimates the skewing factor given by Equation 10. On the other hand, for each feature-value pair (i, v), there is a user-defined like-degree l u (i, v). Its default value is 0.5, i.e. neutral, but the user can set this like-degree to any value in the range [0,1). Using this default value enables the user to easily create a profile by only setting a few like-degrees. The user-defined and learned like-degrees are integrated as follows. We define the integrated like-degree l int(i, v) as l int(i, v) = αl u (i, v)+(1−α)l l (i, v), (15) with α ∈ [0,1]. This l int(i, v) is next used to calculate its corresponding integrated skewing factor r int(i, v) using the inverse function x/(1 − x) of x/(1 + x), which is used in Equation 13. This skewing factor replaces r(i, v) in Equation 12. To generate a recommender that starts off as one based on a userdefined profile and gradually turns into a learning recommender based only on the learned profile, this α should preferably be made dependent on the size of the training set. The thick, solid line in Figure 1 illustrates a possible definition, where the horizontal axis denotes the training set size. K L 1 0 0 α S S0 Figure 1: Possible dependency of α on the training set size S (thick, solid line) and a possible trajectory of α when an update at S0 in the user-defined profile and two pruning actions are performed (thin, grey lines). Initially, that is, until the training set size has a specific size K, the user-defined profile should be active only to allow the learned profile to mature somewhat. Otherwise, the user-defined profile would immediately be contaminated with unreliable data. Then, the learned profile gradually takes over as the training set size increases, until it is sufficiently large, indicated by the limit L, at which point the learned profile has completely taken over. There is, of course, ample freedom in choosing how α depends on S. The linear relationship is chosen here for simplicity. This can be refined by making α dependent on i, the feature under consideration in Equation 15. In particular, Nmin(i), defined as Nmin(i) = min(N(i,−),N(i,+)), (16) 75
provides a measure for the size of the training data pertaining to Let y be an instance with probabilistic feature values, i.e., for ature i. Using this feature-dependent measure is especially rel- each i= 1, 2, ...,f, yi is randomly distributed on Di according to ant for features that are often not assigned a value resulting in CD denote the relatively unreliable estimates of the conditional probabilities possible values of y. In the case that V,(i)l= 1, yi degenerates to nce,we propose to define a as follows a deterministic value. Let the set V=ylv,(,i=1, 2, ...,f) 4=m(,-Amo) denote the set of values that y can attain. We assume that for each y∈vy where for an expression E, (E)+ stands for max(0, E). This def- inition corresponds to the thick, solid line in Figure 1, whereby Pr(y =y)=lpy(i, vi) S=Nmin(i) This independence assumption is similar to the usual conditional 4.1 On the user regaining control independence assumpti We are again interested in Pr(c()=j)for each class j. Let r As the training set size increases, the role of the user-defined a random instance, uniformly distributed on the instance space. We profile generally decreases. Hence, if the user updates his profile, proceed along the same lines as in Section 3 and partially follow say the like-degree of feature-value pair (i, v), while the learned Storr. Note that we use short-hand notation, i.e., we abbreviate profile has nearly or completely taken over, then this does not resort .c(x)=j toj,'r=y'to'y', and'r=y'to'y' in the probability in much effect, if any. function a possible solution to this problem is to prune the rating history uch that Nmin (i) decreases, but this is a rather drastic approach It is instead more elegant to incorporate an offset D(i, v) by which Pr(cy)=j) Nmin(i)is decreased. Instead of using ai, we use aiv, defined as 2 Pr(e(y)=jly=y) Pr(y=y aiv min (L-(Nmini-D(i,v)))+ ∑Pr(c(y)=)Pr(y=y) Now, when the user updates the like-degree of feature-value the corresponding offset is set to, e. g, Nmin (i)or Nmin() o that, for this pair, the recommender starts off using only the t i)-K ∑Pr(j|y)Pr(y=y) defined profile again. Once a]v has become 0 again, D(i, v)can be reset to O as well. See Figure 1 for a possible trajectory of aiv yl j) PrPr=y Pr(y) 4.2 Pruning the rating history To retain flexibility in the learned profile to adapt to user prefer Pr(Evev, II a1 Pr(x;=vil i)P,(i, vi) ences that change slowly over time, the rating history should reg ularly be pruned, e.g., by disregarding the oldest ratings. This is omplicated by the presence of non-zero offsets D(i P(1[=0P(x=P(可 There are various ways to deal with this, but an obvious and (18) Prly simple solution is to try and keep aiv constant, until the size of the training set becomes so small that an increase in aiy is neces- Note that Pr(y)=Pr(y), as r is uniformly distributed on the instance sary. This results in the following update for D(i, v) when Nmin(i) changes from N to N<N. As before, the right-hand side of Equation 18 can be estimated D(i, v): =(D(i, v)-N+N)+ using the training data, which may also contain instances with prob- abilistic feature values. To this end, we generalize the definition of The thin, grey lines in Figure 1 illustrates a possible trajectory, in- the user profile N(, w, j)as follows dicated by the arrowheads, of the value of a after an update in the user-defined profile at S= So and two pruning actions N(1,=∑P2(V) (19 5. MULTI-VALUED FEATURES where X=rEX cr=jJ,i.e the set of class-j training instances. For an instance y, the classification cly) is generalized to It is not uncommon that features may have multiple values, such as the cast of a movie or a list of genres describing a movie in general terms. a possible approach is to consider a list of values ()=arg max Pi s independent and to incorporate them as such in the computation of the posterior probabilities. In particular, if a feature i has a set of values V, then all feature-value pairs (i, v), with v E V serve Multi-valued features are modeled as uniformly distributed prob- as separate features, each contributing a factor in the product in abilistic features, i.., for each i, all values vE V,(i) are equally Equation. Especially when the values in V are dependent, such probable, i.e, Py(i, v)=1/ V,(O)I. In this case, all parameters as with a fixed cast, this approach results in an over-representation Py(i, v) can be omitted from the formula above, as is easily veri- of this feature fied. Equation 19 can in this case be interpreted as if each instance We provide an alternative to this approach, based on [11, whe with a multi-valued feature i, say with m values, is subdivided into the author considers probabilistic feature values: The value of a m sub-instances, one corresponding to each value, each of which is feature i is given by a probability distribution on its domain di counted 1
provides a measure for the size of the training data pertaining to feature i. Using this feature-dependent measure is especially relevant for features that are often not assigned a value, resulting in relatively unreliable estimates of the conditional probabilities. Hence, we propose to define αi as follows. αi = min 1, (L−Nmin(i))+ L−K , (17) where for an expression E, (E) + stands for max(0,E). This definition corresponds to the thick, solid line in Figure 1, whereby S = Nmin(i). 4.1 On the user regaining control As the training set size increases, the role of the user-defined profile generally decreases. Hence, if the user updates his profile, say the like-degree of feature-value pair (i, v), while the learned profile has nearly or completely taken over, then this does not resort in much effect, if any. A possible solution to this problem is to prune the rating history such that Nmin(i) decreases, but this is a rather drastic approach. It is instead more elegant to incorporate an offset D(i, v) by which Nmin(i) is decreased. Instead of using αi , we use αiv, defined as αiv = min 1, (L−(Nmin(i)−D(i, v)))+ L−K . Now, when the user updates the like-degree of feature-value pair (i, v), the corresponding offset is set to, e.g., Nmin(i) or Nmin(i)−K, so that, for this pair, the recommender starts off using only the userdefined profile again. Once αiv has become 0 again, D(i, v) can be reset to 0 as well. See Figure 1 for a possible trajectory of αiv. 4.2 Pruning the rating history To retain flexibility in the learned profile to adapt to user preferences that change slowly over time, the rating history should regularly be pruned, e.g., by disregarding the oldest ratings. This is complicated by the presence of non-zero offsets D(i, v). There are various ways to deal with this, but an obvious and simple solution is to try and keep αiv constant, until the size of the training set becomes so small that an increase in αiv is necessary. This results in the following update for D(i, v) when Nmin(i) changes from N to N 0 < N. D(i, v) := (D(i, v)−N +N 0 ) +. The thin, grey lines in Figure 1 illustrates a possible trajectory, indicated by the arrowheads, of the value of α after an update in the user-defined profile at S = S0 and two pruning actions. 5. MULTI-VALUED FEATURES It is not uncommon that features may have multiple values, such as the cast of a movie or a list of genres describing a movie in general terms. A possible approach is to consider a list of values as independent and to incorporate them as such in the computation of the posterior probabilities. In particular, if a feature i has a set of values V, then all feature-value pairs (i, v), with v ∈ V serve as separate features, each contributing a factor in the product in Equation 8. Especially when the values in V are dependent, such as with a fixed cast, this approach results in an over-representation of this feature. We provide an alternative to this approach, based on [11], where the author considers probabilistic feature values: The value of a feature i is given by a probability distribution on its domain Di . Let y be an instance with probabilistic feature values, i.e., for each i = 1,2,..., f , yi is randomly distributed on Di according to probability distribution py(i, v). Let Vy(i) ⊆ Di denote the set of possible values of yi . In the case that |Vy(i)| = 1, yi degenerates to a deterministic value. Let the set Vy = {v | vi ∈Vy(i),i = 1,2,..., f } denote the set of values that y can attain. We assume that for each v ∈ Vy Pr(y = v) = f ∏ i=1 py(i, vi). This independence assumption is similar to the usual conditional independence assumption. We are again interested in Pr(c(y) = j) for each class j. Let x be a random instance, uniformly distributed on the instance space. We proceed along the same lines as in Section 3 and partially follow Storr ¨ . Note that we use short-hand notation, i.e., we abbreviate ‘c(x) = j’ to ‘j’, ‘x = v’ to ‘v’, and ‘x = y’ to ‘y’ in the probability function. Pr(c(y) = j) = ∑ v∈Vy Pr(c(y) = j | y = v) Pr(y = v) = ∑ v∈Vy Pr(c(v) = j) Pr(y = v) = ∑ v∈Vy Pr(j | v) Pr(y = v) = ∑ v∈Vy Pr(v | j) Pr(j) Pr(v) Pr(y = v) = Pr(j) ∑v∈Vy ∏ f i=1 h Pr(xi = vi | j) py(i, vi) i Pr(y) = Pr(j) ∏ f i=1 h ∑v∈Vy(i) Pr(xi = v | j) py(i, v) i Pr(y) . (18) Note that Pr(v) = Pr(y), as x is uniformly distributed on the instance space. As before, the right-hand side of Equation 18 can be estimated using the training data, which may also contain instances with probabilistic feature values. To this end, we generalize the definition of the user profile N(i, v, j) as follows. N(i, v, j) = ∑ x∈Xj px(i, v), (19) where Xj = {x ∈ X | cx = j}, i.e. the set of class-j training instances. For an instance y, the classification c˜(y) is generalized to c˜(y) = arg max j∈C pj f ∏ i=1 ∑ v∈Vy(i) N(i, v, j) N(i, j) py(i, v) . Multi-valued features are modeled as uniformly distributed probabilistic features, i.e., for each i, all values v ∈ Vy(i) are equally probable, i.e., py(i, v) = 1/ | Vy(i) |. In this case, all parameters py(i, v) can be omitted from the formula above, as is easily veri- fied. Equation 19 can in this case be interpreted as if each instance with a multi-valued feature i, say with m values, is subdivided into m sub-instances, one corresponding to each value, each of which is counted 1/m times. 76
learningpart user-defincdpart like degree like degrees skewing factors skewing factors weighted average ordinary average Figure 3: Subdivision of part of the rating history of a user into a user-defined-profile, training, and test part. skewing factor ewing factor user-defined ke degree like degree like degree 6. PERFORMANCE RESULTS This section consists of three parts. We describe the simulation like degree setup in the first part. In the second, we report on initial results while experimenting with an actual implementation, resulting in a skewing factor wing factor number of improvements. In the third part, we provide a proof of concept by using a specially prepared user-defined profile Figure 2: Graphical views of the integration steps: (a) without 6.1 Simulation setup and (b) with multi-valued features. For a thorough assessment of the performance of the recom- mender, it is necessary to integrate the recommender into an ap lication or system, such as a al video recorder. to invol Now that we have generalized naive Bayesian classification to users,and to define appropriate performance measures. In particu deal with multi-valued features, we next explain how to incorporate a user-defined profile. Analogously to Equation 9, we derive, based ide the aut of broadcast mov a hard disk, the total average value to the user of the content on disk would be an appropriate An P(+12)_P∑∈koPt=v|+) other issue that is of relevance especially in the current context is Pr(-ly) p- Evev, Pr(x=vI-) user satisfaction pertaining to the possibility to control the recor mender. A user may initially set and tune a simple profile, based Using Equation 10, we can rephrase this as on immediate feedback in terms of recommended items we cur rently circumvent these complex assessments and instead focus on Pr(+12_P+n Svevo Pr(xi=vI-r( " (-|y) (20) priately chosen user-defined profile, the proposed solution indeed mitigates the cold-start problem. For reasons of space, we will not rom which it is easily seen that, for each feature i, a weighted investigate sudden changes of interest by users, so that the offsets average of the involved skewing factors is calculated, the weights D(i, v)do not play a role and the a defined in Equation 17 can be depending on the conditional probabilities of the involved values v. used, which is independent of the feature values. As these weights are not available in the user-defined profile, we We consider the rating histories of seven users A, .. G that we suggest replacing the weights by 1/ V,(i) I to calculate the ordi- collected. For each user, we use part of his rating history to create an appropriate user-defined profile. How this is done will shortly Both average skewing factors must next be translated to like- be explained. We use another part of the rating history to train the degrees to take a convex combination of these like-degrees. The recommender and to obtain values for the prior probabilities. We final issue to resolve now is how to take a convex combination like use yet another part to test it. Figure 3 and Table I illustrate the in Equation 15, where the like-degrees have been replaced by the details. The complete rating history of size Shi is shown as a line translation of an average of skewing factors. An obvious solution that loosely corresponds to a time line. It alternatingly contains on from the user-defined profile corresponding to a value for the data used for the user-defined profile, Sud in size, the training set which the contribution is maximal. In this way, whenever the like- of size Str, and the test set of size Ste. The variable S indicates the legree for a value is updated by the user, its role will maximal starting position of the three adjacent parts and this variable runs neate the results. The procedure is illustrated in Figure 2b. For over the rating history to collect sufficient statistics. the learning part of the recommend er. we e stimate the weighted Referring to the table, the parameters K and L pertain to the def average of the learned skewing factors in Equation 20 using the inition of ai in Equation 17. They were chosen, based on some raining data, resulting in a single learned skewing factor, which experience with the recommender. Note that a value of 50 for L is translated to a learned like-degree. For the user-defined part, implies that a rating history of at least 100 programs is required fo we take the ordinary average of the involved user-defined skewing the recommender to become completely based on the learned pro- factors, based on the user-defined like-degrees, resulting in a u file. The parameter d is used in the construction of the user-defined defined skewing factor, which is translated back to a user-defined profile, and is discussed below. For each value of Str, which runs like-degree. Then, these two like-degrees are combined as in equ from0 to 100 with a step size of 5, the variable j, which controls tion 15, but with an alternative value a for a as explained above, S, runs from0 to imax, where jmax is the largest integer satisfying and translated back to an integrated skewing factor the inequality (imax +1)Sud+Str +Ste Shi. This choice causes
weighted average learned skewing factor learned like degree user-defined like degrees integrated like degree integrated skewing factor α’ conditional probability estimates learned skewing factors user-defined skewing factors ordinary average user-defined skewing factor user-defined like degree learned skewing factor learned like degree integrated like degree integrated skewing factor α conditional probability estimates user-defined like degree (a) (b) learning part user-defined part learning part user-defined part Figure 2: Graphical views of the integration steps: (a) without and (b) with multi-valued features. Now that we have generalized naive Bayesian classification to deal with multi-valued features, we next explain how to incorporate a user-defined profile. Analogously to Equation 9, we derive, based on Equation 18, Pr(+ | y) Pr(− | y) = p+ p− f ∏ i=1 ∑v∈Vy(i) Pr(xi = v | +) ∑v∈Vy(i) Pr(xi = v | −) . Using Equation 10, we can rephrase this as Pr(+ | y) Pr(− | y) = p+ p− f ∏ i=1 ∑v∈Vy(i) Pr(xi = v | −)r(i, v) ∑v∈Vy(i) Pr(xi = v | −) , (20) from which it is easily seen that, for each feature i, a weighted average of the involved skewing factors is calculated, the weights depending on the conditional probabilities of the involved values v. As these weights are not available in the user-defined profile, we suggest replacing the weights by 1/ | Vy(i) | to calculate the ordinary average. Both average skewing factors must next be translated to likedegrees to take a convex combination of these like-degrees. The final issue to resolve now is how to take a convex combination like in Equation 15, where the like-degrees have been replaced by the translation of an average of skewing factors. An obvious solution is to use the maximum of αiv over v ∈Vy(i), as this results in a contribution from the user-defined profile corresponding to a value for which the contribution is maximal. In this way, whenever the likedegree for a value is updated by the user, its role will maximally permeate the results. The procedure is illustrated in Figure 2b. For the learning part of the recommender, we estimate the weighted average of the learned skewing factors in Equation 20 using the training data, resulting in a single learned skewing factor, which is translated to a learned like-degree. For the user-defined part, we take the ordinary average of the involved user-defined skewing factors, based on the user-defined like-degrees, resulting in a userdefined skewing factor, which is translated back to a user-defined like-degree. Then, these two like-degrees are combined as in Equation 15, but with an alternative value α 0 for α as explained above, and translated back to an integrated skewing factor. Sud S Str Ste Shi Figure 3: Subdivision of part of the rating history of a user into a user-defined-profile, training, and test part. 6. PERFORMANCE RESULTS This section consists of three parts. We describe the simulation setup in the first part. In the second, we report on initial results while experimenting with an actual implementation, resulting in a number of improvements. In the third part, we provide a proof of concept by using a specially prepared user-defined profile. 6.1 Simulation setup For a thorough assessment of the performance of the recommender, it is necessary to integrate the recommender into an application or system, such as a personal video recorder, to involve users, and to define appropriate performance measures. In particular, when the recommender is used to guide the automatic recording of broadcast movies on a hard disk, the total average value to the user of the content on disk would be an appropriate measure. Another issue that is of relevance especially in the current context is user satisfaction pertaining to the possibility to control the recommender. A user may initially set and tune a simple profile, based on immediate feedback in terms of recommended items. We currently circumvent these complex assessments and instead focus on a proof of concept, wherein we will show that, given an appropriately chosen user-defined profile, the proposed solution indeed mitigates the cold-start problem. For reasons of space, we will not investigate sudden changes of interest by users, so that the offsets D(i, v) do not play a role and the αi , defined in Equation 17 can be used, which is independent of the feature values. We consider the rating histories of seven users A, . . . , G that we collected. For each user, we use part of his rating history to create an appropriate user-defined profile. How this is done will shortly be explained. We use another part of the rating history to train the recommender and to obtain values for the prior probabilities. We use yet another part to test it. Figure 3 and Table 1 illustrate the details. The complete rating history of size Shi is shown as a line that loosely corresponds to a time line. It alternatingly contains positively and negatively rated programs. On this line are indicated the data used for the user-defined profile, Sud in size, the training set of size Str, and the test set of size Ste. The variable S indicates the starting position of the three adjacent parts and this variable runs over the rating history to collect sufficient statistics. Referring to the table, the parameters K and L pertain to the definition of αi in Equation 17. They were chosen, based on some experience with the recommender. Note that a value of 50 for L implies that a rating history of at least 100 programs is required for the recommender to become completely based on the learned pro- file. The parameter d is used in the construction of the user-defined profile, and is discussed below. For each value of Str, which runs from 0 to 100 with a step size of 5, the variable j, which controls S, runs from 0 to jmax, where jmax is the largest integer satisfying the inequality (jmax + 1)Sud + Str + Ste ≤ Shi. This choice causes 77
almost the entire rating history to be used for each value of Str, each Although it is tempting not to use small training sets to generate time with an independent, user-defined profile. These jmax +l re- the priors, it has a detrimental effect on the classification accuracy sults are averaged to compute an error rate Hence, unless there is no training data in either class, in which case the priors are set to 0.5, the training set is used to generate the Table 1: Parameter settings for the simulations. The sometimes necessary Laplace correction needs the domain value(s) sizes of the features. For some features, this domain is very large, such as for actors, and in general these are not known. We instead 0070 use as the domain size for feature i the number of feature -valt pairs(i, v)that exist in the training set, with a lower bound of 1 prevent domains of size 0 In the used rating histories, some features are often not present, 0(5)100 leading to relatively small values of Nmin(), defined in Equation 16 for large training sets. This consequently leads to a slow take-over by the learning recommender. And besides this, the features that are often not present generally do not contribute significantly to the classification accuracy. Hence, all features that were largely The user-defined profile is constructed as follows. absent in the rating histories were discarded. For simplicity, we designated history part, skewing factors are defined for all feature. so discarded the feature'description value pairs, according to Equation 14, using Laplace correction if necessary. These are next converted to like-degrees and rounded to 6.3 Proof of concep the nearest of a limited number of like-degrees Ik, k= 1, 2, ...,d, The performance measure of interest we use is the defined rror rate. Figures 4-10 compare the classification error rates of the integrated recommender with that of a learning recommender k=(k-0.5){d for each of the users A value of d=7 for d seems reasonable. Hence. we mimic a For each of the users the results show that the cold-star user who has meticulously defined a profile that closely matches lem has been significantly reduced. Depending on the use a learned profile based on 50 ratings, but is restrained to only a integrated recommender has an advantage over the learning mited number of equidistant like-degrees. Although this is no mender for a longer or shorter period of time. For each user, the realistic in practice, it does provide proper input for a proof of con- curv verge quite fast to a difference of less than 0.05. This may be explained by the fact that the tastes of the users are obvi As already mentioned, the training set is also used to generate ously not that difficult to learn values for the prior probabilities. This is done as follows. Using Apparently, choosing priors, based on a small training set, and the leave-one-out cross-validation method, see [7 positive poste- using an ordinary average instead of a weighted average as ex rior probability estimates are calculated for all training instances plained in Section 5 do not have a devastating effect. ing uniform priors. Next, a decision threshold in the interval is determined that results in a minimal difference between the positive and negative classification error rate on the classified instances. The decision threshold determines which instances are 045 classified as positive and which ones as negative, based on a com parison with the positive posterior probability estimates. The deci- ion threshold can be translated to prior probabilities. In particular, it can be shown that, if uniform priors are used initially, then for E0.25 a decision threshold y, the positive prior probability should set 1-y. Once the priors have been set, the arg- max operator can be 0.2 used, effectively resulting in using a decision threshold of 0.5 6.2 nitial results 301、9 er will not generally set a large number of like alues of 0.5, and thus skewing factors equal to 1. These neutral training set size alues have a detrimental effect on the classification itigate this, the approach we take is that the ordinary average over Figure 4: Results for user A. the user-defined skewing factors is taken only over those unequal 1. If there are no such skewing factors, the average is set to I This change correspondingly influences the definition of a'in the neral case that the offsets may be unequal 0. The maximum i 7. CONCLUDING REMARKS only taken over those values whose user-defined skewing factors In this paper, we have proposed seamless are unequal to 1. If there are no such skewing factors, all a-values defined and a learned profile into a recommender based on naive for the multi-valued feature are the same, and this value should be Bayesian classification in order to mitigate the cold-start problem chosen. For the latter it is required that the offset D(i, v)should be d to enhance user control eset toO whenever the user resets the corresponding like-degree to The results reported upon provide a proof of concept using a eutral. This, however, is of no concen for the current simulations. limited amount of ground-truth data. The experiments could be
almost the entire rating history to be used for each value of Str, each time with an independent, user-defined profile. These jmax +1 results are averaged to compute an error rate. Table 1: Parameter settings for the simulations. parameter value(s) K 10 L 50 d 7 Sud 50 Str 0 (5)100 Ste 100 S jSud The user-defined profile is constructed as follows. Using the designated history part, skewing factors are defined for all featurevalue pairs, according to Equation 14, using Laplace correction if necessary. These are next converted to like-degrees and rounded to the nearest of a limited number of like-degrees lk , k = 1,2,...,d, defined as lk = (k −0.5)/d. A value of d = 7 for d seems reasonable. Hence, we mimic a user who has meticulously defined a profile that closely matches a learned profile based on 50 ratings, but is restrained to only a limited number of equidistant like-degrees. Although this is not realistic in practice, it does provide proper input for a proof of concept. As already mentioned, the training set is also used to generate values for the prior probabilities. This is done as follows. Using the leave-one-out cross-validation method, see [7], positive posterior probability estimates are calculated for all training instances, using uniform priors. Next, a decision threshold in the interval (0,1) is determined that results in a minimal difference between the positive and negative classification error rate on the classified instances. The decision threshold determines which instances are classified as positive and which ones as negative, based on a comparison with the positive posterior probability estimates. The decision threshold can be translated to prior probabilities. In particular, it can be shown that, if uniform priors are used initially, then for a decision threshold γ, the positive prior probability should set to 1 − γ. Once the priors have been set, the arg-max operator can be used, effectively resulting in using a decision threshold of 0.5. 6.2 Initial results In practice, a user will not generally set a large number of likedegrees, so that the user-defined profile will contain a lot of neutral values of 0.5, and thus skewing factors equal to 1. These neutral values have a detrimental effect on the classification accuracy. To mitigate this, the approach we take is that the ordinary average over the user-defined skewing factors is taken only over those unequal to 1. If there are no such skewing factors, the average is set to 1. This change correspondingly influences the definition of α 0 in the general case that the offsets may be unequal 0. The maximum is only taken over those values whose user-defined skewing factors are unequal to 1. If there are no such skewing factors, all α-values for the multi-valued feature are the same, and this value should be chosen. For the latter it is required that the offset D(i, v) should be reset to 0 whenever the user resets the corresponding like-degree to neutral. This, however, is of no concern for the current simulations. Although it is tempting not to use small training sets to generate the priors, it has a detrimental effect on the classification accuracy. Hence, unless there is no training data in either class, in which case the priors are set to 0.5, the training set is used to generate the priors. The sometimes necessary Laplace correction needs the domain sizes of the features. For some features, this domain is very large, such as for actors, and in general these are not known. We instead use as the domain size for feature i the number of feature-value pairs (i, v) that exist in the training set, with a lower bound of 1 to prevent domains of size 0. In the used rating histories, some features are often not present, leading to relatively small values of Nmin(i), defined in Equation 16 for large training sets. This consequently leads to a slow take-over by the learning recommender. And besides this, the features that are often not present generally do not contribute significantly to the classification accuracy. Hence, all features that were largely absent in the rating histories were discarded. For simplicity, we also discarded the feature ‘description’. 6.3 Proof of concept The performance measure of interest we use is the classification error rate. Figures 4−10 compare the classification error rates of the integrated recommender with that of a learning recommender for each of the users. For each of the users, the results show that the cold-start problem has been significantly reduced. Depending on the user, the integrated recommender has an advantage over the learning recommender for a longer or shorter period of time. For each user, the curves converge quite fast to a difference of less than 0.05. This may be explained by the fact that the tastes of the users are obviously not that difficult to learn. Apparently, choosing priors, based on a small training set, and using an ordinary average instead of a weighted average as explained in Section 5 do not have a devastating effect. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 20 40 60 80 100 classification error rate training set size learning integrated Figure 4: Results for user A. 7. CONCLUDING REMARKS In this paper, we have proposed to seamlessly integrate a userdefined and a learned profile into a recommender based on naive Bayesian classification in order to mitigate the cold-start problem and to enhance user control. The results reported upon provide a proof of concept using a limited amount of ground-truth data. The experiments could be 78
045 是035 5v5每艺 0.35 015·日日日日备备 日日日量 Figure 5: Results for user B. Figure 7: Results for user d. 0.45 邕035 0.4 04 0.25 b5v5艺 0.25 0.15 0.15 0.1 0.05 40 Figure 6: Results for user C. Figure s: results for user e extended with more data, using, for instance, the Duine data set, 8. REFERENCES see [3]. In addition to these experiments, user tests should be per- formed to assess the use and usefulness of the proposed solution including sudden changes of interest of the user. to test sudden [1] Cestnik, B [1990]. Estimating probabilities: a crucial task changes of interest, a possible approach is to combine the rating machine learning. Proceedings of the European Conference histories of two users and append one to the other. Upon the change on Artificial Intelligence, Stockholm, Norway, 147-149. from the first to the second rating history, a user-defined profile is [2] Cost, S,& Salzberg, S [1993]. A weighted nearest neighbor built and incorporated using some data from the second rating his- algorithm for learning with symbolic features Machine tory in a similar way as described in the paper. earning 10, 57-58. The integration of a user-defined profile with a learned profi [3]Duine[2004].DuineProjecthttp://www.telin.nl/project/ also allows the user-defined profile to come from another source, or Home. cfm id=387&language=en even various sources. For example, using the learned profile from [4 Elkan, C(2001]. The foundations of cost-sensitive learning be used to art the learning reco Proceedings of the Seventeenth international Joint thereby still allowing ample freedom to change this profile Conference on Artificial Intelligence, Seattle, Washington, The proposed solution may be refined by incorporating the re 973-978 ults from Pronk, Gutta, and Verhaegh(2005), who extend the [5] Gartner, T. Wu, S,& Flach, P.A. [2001].Data mining on the aive Bayesian classifier by adding confidence levels to the pe Sisyphus dataset: evaluation and integration of results, in: C. terior probability estimates. These confidence levels can provide Giraud-Carrier, N. Lavrac, &s. Moyle(Eds ) Proceedings additional control pertaining to how fast the learned profile could of the Workshop integrating Aspects of Data Minin the user-defined profile. This is a topic for further Decision Support and Meta-Learning, Freiburg, Germany, search optimize the performance of the recommender by excluding certain clustering TV viewing patterns, Proceedings of the Siu 9 A further refinement is to utilize feature-selection algorithms to [6] Kurapati, K, Gutta S [2002]. Instant per features. An additional issue here is that the user might still want International Conference to exert control on the recommender using one of the excluded fea Computing, Banff, Canada tures. Also this is a topic for further research [7] Mitchell, T M.[1997]. Machine Learning. McGraw-Hill
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 20 40 60 80 100 classification error rate training set size learning integrated Figure 5: Results for user B. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 20 40 60 80 100 classification error rate training set size learning integrated Figure 6: Results for user C. extended with more data, using, for instance, the Duine data set, see [3]. In addition to these experiments, user tests should be performed to assess the use and usefulness of the proposed solution, including sudden changes of interest of the user. To test sudden changes of interest, a possible approach is to combine the rating histories of two users and append one to the other. Upon the change from the first to the second rating history, a user-defined profile is built and incorporated using some data from the second rating history in a similar way as described in the paper. The integration of a user-defined profile with a learned profile also allows the user-defined profile to come from another source, or even various sources. For example, using the learned profile from another user can be used to jump-start the learning recommender, thereby still allowing ample freedom to change this profile. The proposed solution may be refined by incorporating the results from Pronk, Gutta, and Verhaegh (2005), who extend the naive Bayesian classifier by adding confidence levels to the posterior probability estimates. These confidence levels can provide additional control pertaining to how fast the learned profile could take over the user-defined profile. This is a topic for further research. A further refinement is to utilize feature-selection algorithms to optimize the performance of the recommender by excluding certain features. An additional issue here is that the user might still want to exert control on the recommender using one of the excluded features. Also this is a topic for further research. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 20 40 60 80 100 classification error rate training set size learning integrated Figure 7: Results for user D. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 20 40 60 80 100 classification error rate training set size learning integrated Figure 8: Results for user E. 8. REFERENCES [1] Cestnik, B. [1990]. Estimating probabilities: a crucial task in machine learning, Proceedings of the European Conference on Artificial Intelligence, Stockholm, Norway, 147–149. [2] Cost, S., & Salzberg, S. [1993]. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning 10, 57–58. [3] Duine [2004]. Duine Project, http://www.telin.nl/project/\ Home.cfm?id=387&language=en. [4] Elkan, C. [2001]. The foundations of cost-sensitive learning, Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Seattle, Washington, 973–978. [5] Gartner ¨ , T., Wu, S., & Flach, P.A. [2001]. Data mining on the Sisyphus dataset: evaluation and integration of results, in: C. Giraud-Carrier, N. Lavrac, & S. Moyle (Eds.), Proceedings of the Workshop Integrating Aspects of Data Mining, Decision Support and Meta-Learning, Freiburg, Germany, 69–80. [6] Kurapati, K., & Gutta S. [2002]. Instant personalization via clustering TV viewing patterns, Proceedings of the Sixth International Conference on Artificial Intelligence and Soft Computing, Banff, Canada. [7] Mitchell, T.M. [1997]. Machine Learning. McGraw-Hill. 79
[8 Pronk, v Gutta, S.V.R.,& Verhaegh, W.F.J. [2005]. corporating confidence in a naive Bayesian classifier, in: L dissono, P. Brna, A. Mitrovic(Eds ), Lecture Notes in 0.35 Artificial Intelligence 3538: Proceedings of the Tenth International Conference on User Modeling Edinburgh, UK. E0.25 317-326. 请时…2:2 [9] Schafer, J B. [2005]. DynamicLens: A dynamic user Proceedings of the Tenth International Conference on Intelligent User Interfaces, San Diego, California. [10] Stanfill, C,& Waltz, D[1986]. Toward based reasoning, Communications of the ACM 29: 12, 1213-1228 [11 Storr, H.-P. [2002 A compact fuzzy extension of the naive Bayesian classification algorithm, Proceedings of the Third Figure 9: Results for user F. International Conference on Intelligent Technologies and ietnam-Japan Symposium on Fuzzy Systems and Applications, Hanoi, Vietnam, 172-177 [12] Zimmerman, J, Kurapati, K, Buczak, A L, Schaffer, D m Martino, J.,& Gutta, S[2004]. TV personalization syster design of a TV show recommender engine and interface, in L. Ardissono, A. Kobsa, M.T. Maybury(Eds ) Targeting Programs to individual Viewers Human-Computer 留aga 0.05 Figure 10: Results for user G
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 20 40 60 80 100 classification error rate training set size learning integrated Figur e 9: Results for user F. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 20 40 60 80 100 classification error rate training set size learning integrated Figur e 10: Results for user G. [8] Pronk, V., Gutta, S.V.R., & Verhaegh, W.F.J. [2005]. Incorporating confidence in a nai v e Bayesian classifier , in: L. Ardissono, P. Brna, & A. Mitrovic (Eds.), Lecture Notes in Artificial Intelligence 3538: Proceedings of the Tenth International Conference on User Modeling Edinburgh, UK, 317–326. [9] Schafer , J.B. [2005]. DynamicLens: A dynamic user interface for a meta-recommendation system, Proceedings of the Tenth International Conference on Intelligent User Interfaces , San Diego, California. [10] Stanfill, C., & Waltz, D. [1986]. To ward memory-based reasoning, Communications of the ACM 29:12 , 1213–1228. [11] Storr ¨ , H.-P. [2002]. A compact fuzzy extension of the nai v e Bayesian classification algorithm, Proceedings of the Third International Conference on Intelligent Technologies and Vietnam-Japan Symposium on Fuzzy Systems and Applications , Hanoi, Vietnam, 172–177. [12] Zimmerman, J., Kurapati, K., Buczak, A.L., Schaffer , D., Martino, J., & Gutta, S. [2004]. TV personalization system: design of a TV sho w recommender engine and interface, in: L. Ardissono, A. Kobsa, & M.T . Maybury (Eds.), Targeting Programs to Individual Viewers Human-Computer Interaction Series 6 . Springer . 80