Hierarchical Probabilistic relational models for collaborative Filtering Jack Newton and russell greiner Department of Computing Science University of Alberta Edmonton. AB. Canada T6G 2E8 i newton, greiner ocs. alberta.ca abstract where each tuple lists facts about a person, then facts about a movie, together with a vote(e. g, a numbe This paper applies Probabilistic Relational between 1 and 5. We could build a classifier that Models(PRMs) to the Collaborative Filter predicts this vote, based on facts about a person and ing task, focussing on the EachMovie data movie - here about George and about SoM set. We first learn a standard PRM. and Notice this prediction does not consider other people show that its performance is competitive with (e. g, people "similar"to George)or other movies (like the best known techniques. We then de- SoM) fine a hierarchical PrM, which extends stan- dard PRMs by dynamically refining classes The other main type of recommender system, " collal into hierarchies, which improves the expres- orative filtering", addresses this deficiently, by using siveness as well as the context sensitivity of associations: If person Pl appears similar to persor the PRM. Finally, we show that hierarchical P2(perhaps based on their previous"liked movies") PRMs achieve state-of-the-art results on this and P2 liked x, then perhaps Pi will like X as well. A dataset pure collaboration-only system would use only a ma- ix, whose(i, j) element is the vote that person i gives movie j, which could be blank. The challenge, then 1 Introduction is using this matrix effectively, to acquire the patterns that will help us predict future votes. While there are a number of other techniques that have proven ef- Personlized recommender systems, which recommend specific products(e. g, books, movies) to individuals fective here, such as clustering, PCA, and K-nearest- neighbor [UF98b [UF98al, notice classical Machine have become very prevalent. The challenge faced by learning techniques do not work here, as there is no redicting what each individual will simple way to map this matrix into a simple fixed-size a pure content-based recommender will base this on only facts about the products themselves and about Of course, we would like to use both content and collab- orative information. Here, we can include, as training the individual (potential)purchaser. This enables us data, facts about the people, facts about the movies to express each possible purchase as a simple vector of and a set of (P, M, v)records, which specifies that attributes, some about the product and others about the person. If we also know who previously liked what, Person P gave movie M the vote of v. we can view this as a standard labelled data sample, The challenge is how to use all of this information and use standard machine learning techniques [ Mit97 to predict how George will vote on SoM. Here, we to learn a classifier. which we can later use to deter- want to use facts about george and about SoM, and mine whether a(novel) person will like some (novel) also find and exploit collaborative properties, that deal with people similar to George (in terms of liking sim- To make this concrete. consider a movie recommenda- ar movies), and movies similar to SoM(in terms of tion system which tries to determine whether a spec eing liked by similar people George like SoundOfMusic(SoM)? A cole.g, will Stepping back, the challenge here is learning a distri- fied person will like a specified mo nt-based bution over a set of databases, here descriptions of sets system could use a large Peoplex Movies database, of people and sets of products, as well as their votes
Hierarchical Probabilistic Relational Models for Collaborative Filtering Jack Newton and Russell Greiner Department of Computing Science University of Alberta Edmonton, AB, Canada T6G 2E8 { newton, greiner }@cs.ualberta.ca Abstract This paper applies Probabilistic Relational Models (PRMs) to the Collaborative Filtering task, focussing on the EachMovie data set. We first learn a standard PRM, and show that its performance is competitive with the best known techniques. We then de- fine a hierarchical PRM, which extends standard PRMs by dynamically refining classes into hierarchies, which improves the expressiveness as well as the context sensitivity of the PRM. Finally, we show that hierarchical PRMs achieve state-of-the-art results on this dataset. 1 Introduction Personlized recommender systems, which recommend specific products (e.g., books, movies) to individuals, have become very prevalent. The challenge faced by these system is predicting what each individual will want. A pure content-based recommender will base this on only facts about the products themselves and about the individual (potential) purchaser. This enables us to express each possible purchase as a simple vector of attributes, some about the product and others about the person. If we also know who previously liked what, we can view this as a standard labelled data sample, and use standard machine learning techniques [Mit97] to learn a classifier, which we can later use to determine whether a (novel) person will like some (novel) item. To make this concrete, consider a movie recommendation system which tries to determine whether a specified person will like a specified movie — e.g., will George like SoundOfMusic (SoM)? A content-based system could use a large People×Movies database, where each tuple lists facts about a person, then facts about a movie, together with a vote (e.g., a number between 1 and 5). We could build a classifier that predicts this vote, based on facts about a person and movie — here about George and about SoM. Notice this prediction does not consider other people (e.g., people “similar” to George) or other movies (like SoM). The other main type of recommender system, “collaborative filtering”, addresses this deficiently, by using associations: If person P1 appears similar to person P2 (perhaps based on their previous “liked movies”), and P2 liked X, then perhaps P1 will like X as well. A pure collaboration-only system would use only a matrix, whose hi, ji element is the vote that person i gives to movie j, which could be blank. The challenge, then, is using this matrix effectively, to acquire the patterns that will help us predict future votes. While there are a number of other techniques that have proven effective here, such as clustering, PCA, and K-nearestneighbor [UF98b] [UF98a], notice classical Machine learning techniques do not work here, as there is no simple way to map this matrix into a simple fixed-size vector of attributes. Of course, we would like to use both content and collaborative information. Here, we can include, as training data, facts about the people, facts about the movies, and a set of hP, M, Vi records, which specifies that person P gave movie M the vote of V. The challenge is how to use all of this information to predict how George will vote on SoM. Here, we want to use facts about George and about SoM, and also find and exploit collaborative properties, that deal with people similar to George (in terms of liking similar movies), and movies similar to SoM (in terms of being liked by similar people). Stepping back, the challenge here is learning a distribution over a set of databases, here descriptions of sets of people and sets of products, as well as their votes
This is quite different from the classical machine learn-(See also [Aha97 ng challenge of learning distributions over tuples(i.e dividual rows of a single relational database) Addressing this type of relational learning and infer- 2 Standard Probabilistic Relational ence problem is exactly the kind of problem Probabilis- Models tic Relational Models(PRMs)[KP98 were designed to do. In this paper we show how PRMs can be success- A PRM can be viewed as an extension of Bayesian fully applied to this learning scenario, for addressing Networks to a relational setting. PRMS learn class- the Collaborative Filtering(CF)problem. We exam- level dependencies that can subsequently be used t he the effectiveness of standard PRMs applied to the make inferences about a particular instance of a class CF task on the EachMovie[Eac] dataset, then we eval- For example, we might connect(the class of)teenage uate the effectiveness of an extended version of PRMs boys to(the class of )action movies, then use that called hierarchical PRMS(hPRMs)[Geto2]. We show to infer that the teenage boy Fred likes the action that standard PRMs can be used to achieve competi- movie Terminator. Of course, we could do some- tive results on the CF task. Furthermore, we demon- thing like that in a standard Bayesian Network, by strate that hPRMs can outperform standard PRMs on first transforming this relational information into a the CF task non-structured, propositionalized form. While this useful for inference(and in fact is what we will do) In the remainder of this section we will provide an this is not an effective way to learn the interrelation overview the CF problem. In Section 2 we describe ships. A PRM can be learned directly on a relational standard PRMs and our application of the PRM database, thereby retaining and leveraging the rich framework to the CF domain. Section 3 introduces structure contained therein how an hPRM can provide a more expressive model We base our notation and conventions for PRMs on of the EachMovie dataset. Finally, Section 4 demon- those used in[Get02]. A PRM operates on a Relational strates the overall effectiveness of PRMs in the CF Schema, which consists of two fundamental elements: problem domain, and in particular the superiority of a set of classes, &=X Xn, and a set of reference hPRMs over standard PrMs slots that define the relationships between the classes Each class X is composed of a set of descriptive at- 1.1 Collaborative Filtering tributes, A(X), which in turn take on a range of val Collaborative Filtering is an area that has attracted in- scribing a domain describing votes on movies. This creasing attention in recent years. In the most general schema has three classes called Vote, Person, and sense, the CF problem involves predicting a users pref- Movie. For the Vote class, the descriptive attribute is erence for a particular item, such as a movie or a book. Score with values 10,, 5]; for Person the descrip- This prediction is based on both the user's known tive attributes are Age and Gender, which take on val- preferences on other items, and also on the known ues young, middle-aged, old) and Male, Female preferences that other, similar, users have demon- respectively; and for Movie the single descriptive at- strated. Widely used collaborative filtering systems in- tribute is Rating which takes on values G, PG, M, R clude amazon com's book recommender and yahoo ' s Furthermore. a class can be associated with a set of LAUNCHcast music recommender syster reference slots, R(X). A particular reference slot, X p, Traditionally, CF algorithms are divided into two describe how objects of class X are related to objects in other classes in the relational schema. One or more major groups: memory-based algorithms and model- reference slots can composed to form a reference slot based algorithms. Memory-based algorithms operate chain, T, and attributes of related objects can be de- over the entire dataset; when a new prediction is re- noted by using the shorthand X.T.A, where A is a quired, a memory-based algorithm generally iterates descriptive attribute of the related class. Continuing over the entire user database to arrive at a prediction. Breese et al. provide an excellent overview of several our example, the Vote class would be associated with memory-based algorithms in BHK98. Model-based wo reference slots: Vote. ofPerson. which describes algorithms, on the other hand, use the user databas how to link Vote objects to a specific Person; and to create a model of which factors influence help pi Vote. ofMovie. which describes how to link vote ob dict a user's preferences, and use this learned model to Jects to a specific Movie object make predictions for a new user. Clustering [BHK98 A PRM II defines a probability distribution over all and Bayesian Models CHM97 are two examples of possible instantiations I of the relational schema. It odel-based methods that have been used in the past. is made ofof two components: the dependency graph
This is quite different from the classical machine learning challenge of learning distributions over tuples (i.e., individual rows of a single relational database). Addressing this type of relational learning and inference problem is exactly the kind of problem Probabilistic Relational Models (PRMs) [KP98] were designed to do. In this paper we show how PRMs can be successfully applied to this learning scenario, for addressing the Collaborative Filtering (CF) problem. We examine the effectiveness of standard PRMs applied to the CF task on the EachMovie [Eac] dataset, then we evaluate the effectiveness of an extended version of PRMs called hierarchical PRMS (hPRMs) [Get02]. We show that standard PRMs can be used to achieve competitive results on the CF task. Furthermore, we demonstrate that hPRMs can outperform standard PRMs on the CF task. In the remainder of this section we will provide an overview the CF problem. In Section 2 we describe standard PRMs and our application of the PRM framework to the CF domain. Section 3 introduces our implementation of hierarchial PRMs, and shows how an hPRM can provide a more expressive model of the EachMovie dataset. Finally, Section 4 demonstrates the overall effectiveness of PRMs in the CF problem domain, and in particular the superiority of hPRMs over standard PRMs. 1.1 Collaborative Filtering Collaborative Filtering is an area that has attracted increasing attention in recent years. In the most general sense, the CF problem involves predicting a user’s preference for a particular item, such as a movie or a book. This prediction is based on both the user’s known preferences on other items, and also on the known preferences that other, similar, users have demonstrated. Widely used collaborative filtering systems include Amazon.com’s book recommender and Yahoo!’s LAUNCHcast music recommender system. Traditionally, CF algorithms are divided into two major groups: memory-based algorithms and modelbased algorithms. Memory-based algorithms operate over the entire dataset; when a new prediction is required, a memory-based algorithm generally iterates over the entire user database to arrive at a prediction. Breese et al. provide an excellent overview of several memory-based algorithms in [BHK98]. Model-based algorithms, on the other hand, use the user database to create a model of which factors influence help predict a user’s preferences, and use this learned model to make predictions for a new user. Clustering [BHK98] and Bayesian Models [CHM97] are two examples of model-based methods that have been used in the past. (See also [Aha97].) 2 Standard Probabilistic Relational Models A PRM can be viewed as an extension of Bayesian Networks to a relational setting. PRMs learn classlevel dependencies that can subsequently be used to make inferences about a particular instance of a class. For example, we might connect (the class of) teenage boys to (the class of) action movies, then use that to infer that the teenage boy Fred likes the action movie Terminator. Of course, we could do something like that in a standard Bayesian Network, by first transforming this relational information into a non-structured, propositionalized form. While this is useful for inference (and in fact is what we will do), this is not an effective way to learn the interrelationships. A PRM can be learned directly on a relational database, thereby retaining and leveraging the rich structure contained therein. We base our notation and conventions for PRMs on those used in [Get02]. A PRM operates on a Relational Schema, which consists of two fundamental elements: a set of classes, X = X1, . . . , Xn, and a set of reference slots that define the relationships between the classes. Each class X is composed of a set of descriptive attributes, A(X), which in turn take on a range of values V (X.A). For example, consider a schema describing a domain describing votes on movies. This schema has three classes called Vote, Person, and Movie. For the Vote class, the descriptive attribute is Score with values {0, . . . , 5}; for Person the descriptive attributes are Age and Gender, which take on values {young, middle-aged, old} and {Male, Female} respectively; and for Movie the single descriptive attribute is Rating which takes on values {G, PG, M, R}. Furthermore, a class can be associated with a set of reference slots, R(X). A particular reference slot, X.ρ, describe how objects of class X are related to objects in other classes in the relational schema. One or more reference slots can composed to form a reference slot chain, τ , and attributes of related objects can be denoted by using the shorthand X.τ.A, where A is a descriptive attribute of the related class. Continuing our example, the Vote class would be associated with two reference slots: Vote.ofPerson, which describes how to link Vote objects to a specific Person; and Vote.ofMovie, which describes how to link Vote objects to a specific Movie object. A PRM Π defines a probability distribution over all possible instantiations I of the relational schema. It is made of of two components: the dependency graph
S, and its associated parameters, es. The dependency task of making an inference about a new, previously structure defines the parents Pa(X A)for each at- unseen Vote. score. To accor is task. we lever tribute X.A. The parent for an attribute X A can age the ground Bayesian Network [Geto2 induced by be a descriptive attribute within the class X, or it can a PRM. Briefly, a Bayesian Network is constructed be a descriptive attribute in another class y that is from a database using the link structure of the as- eachable through a reference slot chain. For instance, sociated PRMs dependency graph, together with the in the above example, Vote. Score could have the par- parameters that are associated with that dependency ent Vote. ofPersonGender. Note that in most cases graph. For example, for the PRM in Figure 1(a), if we the parent of a given attribute will take on a multiset needed to infer the Score value for a new Vote object of values S in V(XT A). For example, we could dis- we simply construct a ground Bayesian Network using cover a dependency of a Person's age on their rating the appropriate attributes retrieved from the associ- of movies in the Children genre. However, we cannot ated Person and Movie objects; see Figure 1(b). The directly model this dependency since the user's rat- PRMs class-level parameters for the various attributes ings on Children's movies is a multiset of values, say are then tied to the ground Bayesian Network's param- 14, 5, 3, 5, 4. For such a numeric attribute, we may eters, and standard Bayesian Network inference proce- choose to use the Median database aggregate operator dures can be used on the resulting network Geto2 to reduce this multiset to a single value. in this case 4. In this paper we reduce s to a single value using various types of aggregation functions. 3 Hierarchical Probabilistic relational The following definition summarizes the key elements Models of a prm Definition 1(IGeto2J)A probabilistic relational In this section we describe our approach to extend- model(PRM)II for a relational schema s is defined as ing standard PRMs to include class hierarchies. The follows. For each class X E x and each descriptive at- concept of learning PRMs with class hierarchies is tribute AEA(x), we have a set of parents Pa(X A), troduced in (Geto2 and a conditional probability distribution( CPD) that represents Pn(XAPa(XA) 3.1 Motivation 2.1 Applying Standard PRMs to the EachMovie dataset The collaborative filtering problem presents two ma- jor motivations for hPRMs. First, in the above model to exploit. In general, model-based collaborative il- Vote. Score depend onitselfi, t ir tes of related ob- PRMs provide an ideal framework for capturing the Vote. Score can depend on attribu kinds of dependencies a recommender system needs jects, such as Person Age, but not possible to have tering algorithms try to capture high-level patterns in the fact that the class-level PRMs dependency struc data that provide some amount of predictive accuracy. ture must be a directed acyclic graph(DAg) in order For example, in the EachMovie dataset, one may want to guarantee that the instance-level ground Bayesian to capture the pattern that young males tend to rate Network forms a DAG FGKP99, and thus a well Action movies quite highly, and subsequently use this formed probability distribution. Without the ability dependency to make inferences about unknown votes. to have Vote. Score depend probabilistically on itself. PRMs are able to model such patterns as class-level we lose the ability to have a user's rating of an item de- dependencies, which can subsequently be used at an pend on his rating of other items or on other user's rat- instance level to make predictions on unknown ratings. ings on this movie. For example, we may wish to have how will George vote on SoM the user's ratings of Comedies infuence his rating of Action movies, or his rating of a specific Comedy movie In order to use a PRM to make predictions about an influence his ratings of other Comedy movies. Second data.In our experiments we use the prm learn in the above model we are restricted to one depen- ing produce described in FGKPggl, which provides the type of object the rating is for, we may wish to In algorithm for both learning a legal structure for a have a specialized dependency graph to better model PRM and estimating the parameters associated with the dependencies. For example, the dependency grap that PRM. Figure 1(a)shows a sample PRM structure for an Action movie may have Vote. Score depend on learned from the eachMovie dataset Vote. PersonOf. Gender, whereas a Documentary may With the learned PRM in hand, we are left with the depend on Vote. PersonOf Age
S, and its associated parameters, θS . The dependency structure defines the parents P a(X.A) for each attribute X.A. The parent for an attribute X.A can be a descriptive attribute within the class X, or it can be a descriptive attribute in another class Y that is reachable through a reference slot chain. For instance, in the above example, Vote.Score could have the parent Vote.ofPerson.Gender. Note that in most cases the parent of a given attribute will take on a multiset of values S in V (X.τ.A). For example, we could discover a dependency of a Person’s age on their rating of movies in the Children genre. However, we cannot directly model this dependency since the user’s ratings on Children’s movies is a multiset of values, say {4, 5, 3, 5, 4}. For such a numeric attribute, we may choose to use the Median database aggregate operator to reduce this multiset to a single value, in this case 4. In this paper we reduce S to a single value using various types of aggregation functions. The following definition summarizes the key elements of a PRM: Definition 1 ([Get02]) A probabilistic relational model (PRM) Π for a relational schema S is defined as follows. For each class X ∈ X and each descriptive attribute A ∈ A(X), we have a set of parents Pa(X.A), and a conditional probability distribution (CPD) that represents PΠ(X.A|P a(X.A)) 2.1 Applying Standard PRMs to the EachMovie Dataset PRMs provide an ideal framework for capturing the kinds of dependencies a recommender system needs to exploit. In general, model-based collaborative filtering algorithms try to capture high-level patterns in data that provide some amount of predictive accuracy. For example, in the EachMovie dataset, one may want to capture the pattern that young males tend to rate Action movies quite highly, and subsequently use this dependency to make inferences about unknown votes. PRMs are able to model such patterns as class-level dependencies, which can subsequently be used at an instance level to make predictions on unknown ratings. — i.e., how will George vote on SoM. In order to use a PRM to make predictions about an unknown rating, we must first learn the PRM from data. In our experiments we use the PRM learning produce described in [FGKP99], which provides an algorithm for both learning a legal structure for a PRM and estimating the parameters associated with that PRM. Figure 1(a) shows a sample PRM structure learned from the EachMovie dataset. With the learned PRM in hand, we are left with the task of making an inference about a new, previously unseen Vote.score. To accomplish this task, we leverage the ground Bayesian Network [Get02] induced by a PRM. Briefly, a Bayesian Network is constructed from a database using the link structure of the associated PRM’s dependency graph, together with the parameters that are associated with that dependency graph. For example, for the PRM in Figure 1(a), if we needed to infer the Score value for a new Vote object, we simply construct a ground Bayesian Network using the appropriate attributes retrieved from the associated Person and Movie objects; see Figure 1(b). The PRM’s class-level parameters for the various attributes are then tied to the ground Bayesian Network’s parameters, and standard Bayesian Network inference procedures can be used on the resulting network [Get02]. 3 Hierarchical Probabilistic Relational Models In this section we describe our approach to extending standard PRMs to include class hierarchies. The concept of learning PRMs with class hierarchies is introduced in [Get02]. 3.1 Motivation The collaborative filtering problem presents two major motivations for hPRMs. First, in the above model, Vote.Score can depend on attributes of related objects, such as Person.Age, but it is not possible to have Vote.Score depend on itself in any way. This is due to the fact that the class-level PRM’s dependency structure must be a directed acyclic graph (DAG) in order to guarantee that the instance-level ground Bayesian Network forms a DAG [FGKP99], and thus a wellformed probability distribution. Without the ability to have Vote.Score depend probabilistically on itself, we lose the ability to have a user’s rating of an item depend on his rating of other items or on other user’s ratings on this movie. For example, we may wish to have the user’s ratings of Comedies influence his rating of Action movies, or his rating of a specific Comedy movie influence his ratings of other Comedy movies. Second, in the above model we are restricted to one dependency graph for Vote.Score; however, depending on the type of object the rating is for, we may wish to have a specialized dependency graph to better model the dependencies. For example, the dependency graph for an Action movie may have Vote.Score depend on Vote.PersonOf.Gender, whereas a Documentary may depend on Vote.PersonOf.Age
Person Movie ) Figure 1:(a) Standard PRM learned on EachMovie dataset(b) Ground Bayesian Network for one Vote object erence slots. For example, if we specialize the Movie class, we implicitly specialize the related vote table into a hierarchy as well. For example, in Figure 3, the Vote class is refined into four different pseudo-classes each associated with one of the hierarchy elements in Action Comedy Thrille basic(H[XD) Definition 2 A Hierarchical Probabilistic relational Model(hPRM) IIH is defined as Romantic Slapstick Comedy Comedy A class hierarchy H[X=(C(X, A set of basic, leaf-node elements basic(HXDE H Figure 2: Sample class hierarchy a subclass indicator attribute X Class E basic(HID 3.2 Overview · For each subclass c∈cx] and attribute a∈ A(X)we have a specialized CPD for c denoted To address the problem described above, we must in troduce a class hierarchy that applies to our dataset P(X A Pa(XA)) and modify the PRM learning procedure to leverage For every class r reachable via a reference slot this class hierarchy in making predictions. In general ain from X we have a specialized CPD for c the class hierarchy can either be provided as input, or denoted P(Y.A Pa(YA)) can be learned directly from the data. We refer to the class hierarchy for class X as H[X]. Figure 2 shows The algorithm for learning an hPRM is very similar to a sample class hierarchy for the EachMovie domain. the algorithm for learning a standard PRM. Instead of H[X] is a DAG that defines an IS-A hierarchy using dealing with the standard set of classes a when evalu- Xe is a direct subclass of Xa(and Xa is a direct su- into the subclasses defined by HIX]. For inference,. he subclass relation over a finite set of subclasses ating structure quality and estimating parameters, ou CXIGet02. For a given c,dEC[X], c< d indicates hPRM algorithm dynamically partitions the datas perclass of Xc). The leaf nodes of H(X represent the similar technique is used, as for any given instance i basic subclasses of the hierarchy, denoted basic(H(X)). of a class, is place in the hierarchy is flagged through In this paper we assume all objects are members of a XClass; using this flag it is possible to associate the basic subclass, although this is not a fundamental limi- proper CPD with a given class instance tation of hPRMs. Each object of class X has a subclass indicator X Class e basic(H(XD, which can either be 3.3 Applying hPRMs to the EachMovie defined manually or learned automatically by a sup- Dataset plementary algorithm. By defining a hierarchy for a lass X in a PRM, we also implicitly specialize the Applying the hPrM framework to the EachMovie classes that are reachable from X via one or more ref- dataset requires a hierarchy to be defined, which is
Figure 1: (a) Standard PRM learned on EachMovie dataset (b) Ground Bayesian Network for one Vote object Figure 2: Sample class hierarchy 3.2 Overview To address the problem described above, we must introduce a class hierarchy that applies to our dataset, and modify the PRM learning procedure to leverage this class hierarchy in making predictions. In general, the class hierarchy can either be provided as input, or can be learned directly from the data. We refer to the class hierarchy for class X as H[X]. Figure 2 shows a sample class hierarchy for the EachMovie domain. H[X] is a DAG that defines an IS-A hierarchy using the subclass relation ≺ over a finite set of subclasses C[X] [Get02]. For a given c, d ∈ C[X], c ≺ d indicates Xc is a direct subclass of Xd (and Xd is a direct superclass of Xc). The leaf nodes of H[X] represent the basic subclasses of the hierarchy, denoted basic(H[X]). In this paper we assume all objects are members of a basic subclass, although this is not a fundamental limitation of hPRMs. Each object of class X has a subclass indicator X.Class ∈ basic(H[X]), which can either be defined manually or learned automatically by a supplementary algorithm. By defining a hierarchy for a class X in a PRM, we also implicitly specialize the classes that are reachable from X via one or more reference slots. For example, if we specialize the Movie class, we implicitly specialize the related V ote table into a hierarchy as well. For example, in Figure 3, the V ote class is refined into four different pseudo-classes, each associated with one of the hierarchy elements in basic(H[X]). Definition 2 A Hierarchical Probabilistic Relational Model (hPRM) ΠH is defined as: • A class hierarchy H[X] = (C[X], ≺) • A set of basic, leaf-node elements basic(H[X]) ∈ H[X] • A subclass indicator attribute X.Class ∈ basic(H[X]) • For each subclass c ∈ C[X] and attribute A ∈ A(X) we have a specialized CPD for c denoted P(Xc .A|P ac (X.A)) • For every class Y reachable via a reference slot chain from X we have a specialized CPD for c denoted P(Y c .A|P ac (Y.A)) The algorithm for learning an hPRM is very similar to the algorithm for learning a standard PRM. Instead of dealing with the standard set of classes X when evaluating structure quality and estimating parameters, our hPRM algorithm dynamically partitions the dataset into the subclasses defined by H[X]. For inference, a similar technique is used, as for any given instance i of a class, i’s place in the hierarchy is flagged through X.Class; using this flag it is possible to associate the proper CPD with a given class instance. 3.3 Applying hPRMs to the EachMovie Dataset Applying the hPRM framework to the EachMovie dataset requires a hierarchy to be defined, which is
Person ActionMovie Age Education Status Action. Vote Romantic. Movie Gender Status Romantic. comedy.Vote Status omedy. Vote tatus Thriller-Movie heater Video Thriller- Vote Status Figure 3: Example hPrm for EachMovie dataset then used to build an hPrM that is ultimately used ing task for the EachMovie dataset. We also compare to make predictions for unknown votes. our results to other CF algorithms In our experiments we automatically learn a hierarchy to be used in the learning procedure. In the Each Movie database, a movie can belong to zero or more of the following genre categories: action, animation, 4.1 Experimental Design art-foreign, classic, comedy, drama, family, hoTToT, ro- One of the main challenges in designing an experiment We denote the set of genres that a movie belongs test the predictive accuracy of a PRM model is to with g. For example, g(When HarryMetsally in avoiding resubstitution error. If a PRM is learned comedy, drama, romance/. To build our hierarchy dy- on the entire EachMovie database, and subsequently namically, we first enumerate all combinations of gen- used to make predictions on objects from the same res that appear in the EachMovie database, and de- database, we are using the same data for testing as we note this set g. We then proceed to greedily partition used for training g by the number of movies a given element of g is associated with until we reach a predefined limit of We address this issue by applying a modified cross- k partitions. We define one additional noisy-or-type validation procedure to the dataset. While the tradi- partition that is used to for movies that do not fall tional method of dividing data into cross-validation into one of the predefined partitions. This partition, folds cannot be applied directly to a relational together with the other k partitions, are used to create datab a k+ 1-element hierarch ting as follows. For n-fold cross validation. we first create n new datasets D1. Dn with the eachMovie With the hierarchy defined, the hPRM is applied to the data schema. We then iterate over all the objects in EachMovie dataset just as the standard PRM model the Person table, and randomly allocate the indi- was in Section 2.1, with the exception that the learning vidual to one of Di E Di.. Dn. Finally, we add procedure is modified as outlined abor all the Vote objects linked to that individual, and all the Movie objects linked to those Vote objects, to Di. When this procedure is complete, n datasets with 4 Experimental Results roughly balanced properties(in terms of number of in- dividuals, number of votes per person, etc )will have In this section we outline our results in applying both been created. In our experiments we use 5-fold cross standard PRMs and hPRMs to the collaborative filter- validation
Figure 3: Example hPrm for EachMovie dataset then used to build an hPRM that is ultimately used to make predictions for unknown votes. In our experiments we automatically learn a hierarchy to be used in the learning procedure. In the EachMovie database, a movie can belong to zero or more of the following genre categories: action, animation, art foreign, classic, comedy, drama, family, horror, romance, thriller. We denote the set of genres that a movie belongs to with G. For example, G(WhenHarryMetSally) = {comedy, drama, romance}. To build our hierarchy dynamically, we first enumerate all combinations of genres that appear in the EachMovie database, and denote this set G. We then proceed to greedily partition G by the number of movies a given element of G is associated with until we reach a predefined limit of k partitions. We define one additional noisy-or–type partition that is used to for movies that do not fall into one of the predefined partitions. This partition, together with the other k partitions, are used to create a k + 1-element hierarchy. With the hierarchy defined, the hPRM is applied to the EachMovie dataset just as the standard PRM model was in Section 2.1, with the exception that the learning procedure is modified as outlined above. 4 Experimental Results In this section we outline our results in applying both standard PRMs and hPRMs to the collaborative filtering task for the EachMovie dataset. We also compare our results to other CF algorithms. 4.1 Experimental Design One of the main challenges in designing an experiment to test the predictive accuracy of a PRM model is in avoiding resubstitution error. If a PRM is learned on the entire EachMovie database, and subsequently used to make predictions on objects from the same database, we are using the same data for testing as we used for training. We address this issue by applying a modified crossvalidation procedure to the dataset. While the traditional method of dividing data into cross-validation folds cannot be applied directly to a relational database, we extend the basic idea to a relational setting as follows. For n-fold cross validation, we first create n new datasets D1 . . . Dn with the EachMovie data schema. We then iterate over all the objects in the P erson table, and randomly allocate the individual to one of Di ∈ Di . . . Dn. Finally, we add all the V ote objects linked to that individual, and all the Movie objects linked to those V ote objects, to Di. When this procedure is complete, n datasets with roughly balanced properties (in terms of number of individuals, number of votes per person, etc.) will have been created. In our experiments we use 5-fold cross validation
4.2 Evaluation criteria dataset. In our experiment we set the size of the hi- erarchy to be 12. Our greedy partitioning algorithm In this paper we adopt the Absolute Deviation metric arrived at the following basic classes: drama, comedy, [MRK97, BHK98] to assess the quality of our CF algo- classic, action, art-foreign Drama, thriller, romance rithms. We divide the data into a training and test set medy, none, family, horror, action Thriller, other using the method described above and build a PRM using the training data. We then iterate over each user n the test set. allowing each user to become the active Algorithm Absolute Deviation user. for the active user we then iterate over his set 0.994 of votes, Pa, allowing each vote to become the active 1.10 vote; the remaining votes are used in the PRM model N 1.066 (if needed ) The predicted vote for the active user a 2.136 on movie j is denoted pa, j, whereas the actual vote is PRM 1.060 denoted u The average absolute deviation wnere ma is the number of vote predictions made, is: Table 2: Absolute deviation scoring results for each- Movie dataset. Lower scores are better Sa ma jENa By applying hPRMs to the eachMovie dataset, we are able to reduce the absolute deviation error from 1.26 The absolute deviation for the dataset as a whole is (with standard PRMs)to 1.06. Again, for comparison arrived at by averaging this score over all the users in we include results from BHK98]; however, in since the test set of users hPRMs are able to leverage other he user has made in making predictions, we use the All-But-One 4. 3 Standard PrMs results presented in [BHK98, where the prediction al is able to use all of the ac Igorithm Absolute Deviation (except for the current active vote) in making a pre- 1.257 diction. As one can see by comparing Table 1 to Table 2, including the additional voting information results in a substantial reduction in error rate for most of the ⅤSIM other four algorithms PRM 1.26 hPRMs not only provide a significant performance ad- vantage over standard PRMs. but are also able to out- Table 1: Absolute Deviation scoring results for Each- l but one of the other four al Movie dataset. Lower scores are better In our experiments we were able to achieve an abso- 5 Future Work lute deviation error of 1. 26. For comparison, we have included the results from [BHK98]; in this paper four In this paper we learned a fairly broad hierarchy based CF algorithms were tested: correlation(CR), Bayesian on various sub-genres in the EachMovie dataset. How- Vector Similarity(VSIM). We have elected to include ever, hPRMs allow for arbitrarily specific class hierar chies, where a leaf-node entry for H[X might be a the results from[BHK98 where algorithms were given handful (or even just one)movie. This could be ex- two votes out of the non-active votes to use in making the prediction, since the standard PRM model does ploited when there are certain indicator movies in a not have any direct dependency on other Votes Genre that may accurately predict a user's votes on other movies in that(or other) Genres, and that fur In this experiment standard PRMs are able to outper- thermore most users have seen. For example, whether form the vSIM algorithm, and is competitive with the a user likes science fiction movies or not, they have correlation-based algorithm. However, both Bayesian likely seen the Star Wars Trilogy; this fact could b Clustering and the Bayesian Network model have su- leverage by making the Star Wars movies a basic sub- perior results in this context class in the class hierarchy, and subsequently used t learn new dependencies of the type if a user likes the 4. 4 Hierarchical PRMs Star Wars movies, they will in general like science fi tion movies. Learning such a hierarchy is a challenging The first part of the experiment for hPRMs was task that would likely significantly improve the perfor- constructing a class hierarchy from the EachMovie mance of hPRMs
4.2 Evaluation Criteria In this paper we adopt the Absolute Deviation metric [MRK97, BHK98] to assess the quality of our CF algorithms. We divide the data into a training and test set using the method described above, and build a PRM using the training data. We then iterate over each user in the test set, allowing each user to become the active user. For the active user we then iterate over his set of votes, Pa, allowing each vote to become the active vote; the remaining votes are used in the PRM model (if needed). The predicted vote for the active user a on movie j is denoted pa,j , whereas the actual vote is denoted va,j . The average absolute deviation, where ma is the number of vote predictions made, is: Sa = 1 ma X j∈Pa |pa,j − va,j | (1) The absolute deviation for the dataset as a whole is arrived at by averaging this score over all the users in the test set of users. 4.3 Standard PRMs Algorithm Absolute Deviation CR 1.257 BC 1.127 BN 1.143 VSIM 2.113 PRM 1.26 Table 1: Absolute Deviation scoring results for EachMovie dataset. Lower scores are better. In our experiments we were able to achieve an absolute deviation error of 1.26. For comparison, we have included the results from [BHK98]; in this paper four CF algorithms were tested: correlation (CR), Bayesian Clustering (BC), a Bayesian Network model (BN), and Vector Similarity (VSIM). We have elected to include the results from [BHK98] where algorithms were given two votes out of the non-active votes to use in making the prediction, since the standard PRM model does not have any direct dependency on other V otes. In this experiment standard PRMs are able to outperform the VSIM algorithm, and is competitive with the correlation-based algorithm. However, both Bayesian Clustering and the Bayesian Network model have superior results in this context. 4.4 Hierarchical PRMs The first part of the experiment for hPRMs was constructing a class hierarchy from the EachMovie dataset. In our experiment we set the size of the hierarchy to be 12. Our greedy partitioning algorithm arrived at the following basic classes: drama, comedy, classic, action, art-foreignDrama, thriller, romancecomedy, none, family, horror, actionThriller, other. Algorithm Absolute Deviation CR 0.994 BC 1.103 BN 1.066 VSIM 2.136 hPRM 1.060 Table 2: Absolute Deviation scoring results for EachMovie dataset. Lower scores are better. By applying hPRMs to the EachMovie dataset, we are able to reduce the absolute deviation error from 1.26 (with standard PRMs) to 1.06. Again, for comparison we include results from [BHK98]; however, in since hPRMs are able to leverage other votes the user has made in making predictions, we use the All-But-One results presented in [BHK98], where the prediction algorithm is able to use all of the active user’s votes (except for the current active vote) in making a prediction. As one can see by comparing Table 1 to Table 2, including the additional voting information results in a substantial reduction in error rate for most of the other four algorithms. hPRMs not only provide a significant performance advantage over standard PRMs, but are also able to outperform all but one of the other four algorithms. 5 Future Work In this paper we learned a fairly broad hierarchy based on various sub-genres in the EachMovie dataset. However, hPRMs allow for arbitrarily specific class hierarchies, where a leaf-node entry for H[X] might be a handful (or even just one) movie. This could be exploited when there are certain indicator movies in a Genre that may accurately predict a user’s votes on other movies in that (or other) Genres, and that furthermore most users have seen. For example, whether a user likes science fiction movies or not, they have likely seen the Star Wars Trilogy; this fact could be leverage by making the Star Wars movies a basic subclass in the class hierarchy, and subsequently used to learn new dependencies of the type if a user likes the Star Wars movies, they will in general like science fiction movies. Learning such a hierarchy is a challenging task that would likely significantly improve the performance of hPRMs
6 Conclusion MRK97 Bradley N. Miller, John T. Riedl, and Joseph A an Experience with Grou- In this paper we out lined a framework for modelling Usenet useful again. In the collaborative filtering problem with PRMs. We USENIX r. 1997 Annual Technical model the Cf problem first using a standard PRM Conference, January 6-10, 1997. Ana then we extend model to account for hierarchical rela- heim, CA, USA, pages 219-233, Berkeley, tionships that are present in the data. hPRMs improve CA. USA. 1997. USENIX the expressiveness and context-sensitivity of standard PRMs, and also realize real-world performance bene UF98a L Ungar and D Foster. Clustering meth ods for collaborative filtering. 1998 [UF98b L. Ungar and D. Foster. A formal sta- Acknowledgements tistical approach to collaborative filtering 1998. and encouragement(as well as access to her sof Both authors thank lise getoor for useful discussion ware), Dale Schuurmans for his advice on this project and thank NSERC and the Alberta Ingenuity Cen- tre for Machine Learning for funding. Some of this work was done when RG was on sabbatical visiting CALD/CMU References Aha97 D. Aha. Special issue on "Lazy Learn- ing". Artificial Intelligence Review, 11(1 5),Fe BHK98 John S. Breese, David Heckerman, and Carl Kadie. Empirical analysis of predic tive algorithms for collaborative filtering In UA198, pages 43-52, 1998 [CHM97 David Maxwell Chickering, David Hecker man, and Christopher Meek. A Bayesian approach to learning Bayesian networks with local structure. In UA197, pages 80- Eac http://research.compaqcom/src/eachmovIe/. FGKP99 Nir Friedman, Lise Getoor, Daphne Koller, v earning probabilistic re- lational models. In IJCAI, pages 1300- [Geto2 L Getoor Learning Statistical Models from Relational Data. PhD thesis. Stanford Uni- KP98 D. Koller and A. Pfeffer. Probabilistic frame-based systems. In Proc. of the Fif- teenth National Conference on Artificial Intelligence, pages 580-587, Madison, WI lite McGraw-Hill. 1997
6 Conclusion In this paper we outlined a framework for modelling the collaborative filtering problem with PRMs. We model the CF problem first using a standard PRM, then we extend model to account for hierarchical relationships that are present in the data. hPRMs improve the expressiveness and context-sensitivity of standard PRMs, and also realize real-world performance bene- fits. Acknowledgements Both authors thank Lise Getoor for useful discussions and encouragement (as well as access to her software), Dale Schuurmans for his advice on this project, and thank NSERC and the Alberta Ingenuity Centre for Machine Learning for funding. Some of this work was done when RG was on sabbatical visiting CALD/CMU. References [Aha97] D. Aha. Special issue on “Lazy Learning”. Artificial Intelligence Review, 11(1– 5), February 1997. [BHK98] John S. Breese, David Heckerman, and Carl Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In UAI98, pages 43–52, 1998. [CHM97] David Maxwell Chickering, David Heckerman, and Christopher Meek. A Bayesian approach to learning Bayesian networks with local structure. In UAI97, pages 80– 89, 1997. [Eac] http://research.compaq.com/SRC/eachmovie/. [FGKP99] Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300– 1309, 1999. [Get02] L. Getoor. Learning Statistical Models from Relational Data. PhD thesis, Stanford University, 2002. [KP98] D. Koller and A. Pfeffer. Probabilistic frame-based systems. In Proc. of the Fifteenth National Conference on Artificial Intelligence, pages 580–587, Madison, WI, 1198. [Mit97] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997. [MRK97] Bradley N. Miller, John T. Riedl, and Joseph A. Konstan. Experience with GroupLens: Making Usenet useful again. In USENIX, editor, 1997 Annual Technical Conference, January 6–10, 1997. Anaheim, CA, USA, pages 219–233, Berkeley, CA, USA, 1997. USENIX. [UF98a] L. Ungar and D. Foster. Clustering methods for collaborative filtering, 1998. [UF98b] L. Ungar and D. Foster. A formal statistical approach to collaborative filtering, 1998