Using Genetic Algorithm for Hybrid Modes of Collaborative Filtering in Online recommenders Simon Fong, Y vonne Ho, Yang Hang Faculty of Science and Technology, University of Macau, Macau SAR ccfong @umac mo Abstract Content-based Filtering selects items based on the orrelation between the content of the items and user Online recommenders are usually referred to those preferences. Current search engines are based on used in e-Commerce websites for suggesting a product automatic analysis of the content of documents and the or service out of many choices. The core technology content of user's query. As an example shown in Figure implemented behind this type of recommenders la, that is a movie recommendation application, in includes content analysis, collaborative filtering and order to recommend movies to a user, the content some hybrid variants. Since they all have certain based recommender system tries to understand the strengths and limitations, combining them may be a commonalities among the movies that the user has rated highly in the past(specific actors, directors overcoming a large amount of input variables genres, subject matter, etc. ) Movie item D is hence especially from combining different techniques. recommended because it is has a certain high degree of Genetic algorithm(GA)is an ideal optimization search similarity to Item a by its content. function, for finding a best recommendation out of a For another example, a Personalized Recommender large population of variables. In this paper we System [l] creates dynamic hyperlinks on a web site presented a GA-based approach for supporting that contains a collection of advises about do it yourself ombined modes of collaborative filtering. In particular, we show that how the input variables can based on the similarity that they have, to the ones the be coded into ga chromosomes in various modes. user has highly rated in the past. In addition to Insights of how Ga can be used in recommenders are hypertext content, the same concept on content analysis derived through our experiments with the input data was extended in [2] for recommending multimedia taken from Movielens and IMDB products such as mp3 music on a recommender website 1 Introduction One common problem with the Content-based recommendation system is that it can only recommend Recommender systems have been widely adopted by items scoring highly against the user profile so the user e-Commerce websites to suggest products or services is restricted to see the items similar to those already to customers. In general, they assist users to narrow rated. new items will never be recommended due to the down their choices for making a purchase decision working on an individual user. from a large pool of items. One simple way of offering Collaborative Filtering(CF) based on the similarity recommendation is based on the top selling products between currently active user and his however does not differentiate customers who new items the active has never seen before but they have different tastes. The other approach is known were guessed to be interested by him because the other as one-to-one marketing, which takes into account the users who have similar interest to his have seen/liked nformation about a particular user and tries to find The similarity can either be measured by the same item personal match of product that is predicted to best sui which known as item-based CF or by the ype his or her flavor. Generally, there are three methods of user, known as user-based CF. This method is to the recommendation systems for one-to-one marketing suggest new items or to predict the utility of a certain Content-based Filtering, Collaborative Filtering and tem for a particular user based on his previous likings Hybrid Model. or the opinions of other like-minded users
Using Genetic Algorithm for Hybrid Modes of Collaborative Filtering in Online Recommenders Simon Fong, Yvonne Ho, Yang Hang Faculty of Science and Technology, University of Macau, Macau SAR ccfong@umac.mo Abstract Online recommenders are usually referred to those used in e-Commerce websites for suggesting a product or service out of many choices. The core technology implemented behind this type of recommenders includes content analysis, collaborative filtering and some hybrid variants. Since they all have certain strengths and limitations, combining them may be a promising solution provided there is a way of overcoming a large amount of input variables especially from combining different techniques. Genetic algorithm (GA) is an ideal optimization search function, for finding a best recommendation out of a large population of variables. In this paper we presented a GA-based approach for supporting combined modes of collaborative filtering. In particular, we show that how the input variables can be coded into GA chromosomes in various modes. Insights of how GA can be used in recommenders are derived through our experiments with the input data taken from Movielens and IMDB. 1. Introduction Recommender systems have been widely adopted by e-Commerce websites to suggest products or services to customers. In general, they assist users to narrow down their choices for making a purchase decision from a large pool of items. One simple way of offering recommendation is based on the top selling products. This however does not differentiate customers who may have different tastes. The other approach is known as one-to-one marketing, which takes into account the information about a particular user and tries to find a personal match of product that is predicted to best suit his or her flavor. Generally, there are three methods of the recommendation systems for one-to-one marketing: Content-based Filtering, Collaborative Filtering and Hybrid Model. Content-based Filtering selects items based on the correlation between the content of the items and user preferences. Current search engines are based on automatic analysis of the content of documents and the content of user’s query. As an example shown in Figure 1a, that is a movie recommendation application, in order to recommend movies to a user, the contentbased recommender system tries to understand the commonalities among the movies that the user has rated highly in the past (specific actors, directors, genres, subject matter, etc.). Movie item D is hence recommended because it is has a certain high degree of similarity to Item A by its content. For another example, a Personalized Recommender System [1] creates dynamic hyperlinks on a web site that contains a collection of advises about do it yourself home improvement. The advices are recommended based on the similarity that they have, to the ones the user has highly rated in the past. In addition to hypertext content, the same concept on content analysis was extended in [2] for recommending multimedia products such as mp3 music on a recommender website. One common problem with the Content-based recommendation system is that it can only recommend items scoring highly against the user profile; so the user is restricted to see the items similar to those already rated, new items will never be recommended due to the working on an individual user. Collaborative Filtering (CF) based on the similarity between currently active user and other users, finds new items the active has never seen before but they were guessed to be interested by him because the other users who have similar interest to his have seen/liked. The similarity can either be measured by the same item which known as item-based CF or by the same type of user, known as user-based CF. This method is to suggest new items or to predict the utility of a certain item for a particular user based on his previous likings or the opinions of other like-minded users
(a)Content-based Filtering (b) Item-based Collaborative Filtering (c) User-based Collaborative Filtering ples of different Filtering With item-based collaborative filtering as shown in 1. Implementing collaborative and content-based Figure Ib, if many users(User 2 and User3 in this methods separately and combining their predictions, example)like Item A and Item D, we assume that Item 2. Incorporating some content-based characteristics into A is highly correlated with Item D. If Userl likes Item a collaborative approach, A, he should also like Item D given the strong link of 3. Incorporating some collaborative characteristics into a association between items a and d content-based approach, and Item based Collaborative Filtering is quite common 4. Constructing a general unifying model that for current recommendation systems, which have been ncorporates both content-based and collaborative wildly used by Movie Lens and Yahoo. According to characteristics papers [3][4], item-based first analyze One thing in common across the above four fusion user-item matrix to identify relationships between approaches is that a large amount of input data is different items, and then use these relationships to required. Combining several filtering approache directly compute recommendations for users implies more than one aspect of input data might need With user-based collaborative filtering as shown in to be incorporated into the recommender system, such Figure Ic, the taste of User3 is very similar to that of s the users'information, item contents and users Userl because they have preferred Item A and Itemb ratings. A table summarizing what input data is in common. They are deemed to be like-minded users. below. d for the filtering approa Hence user 3 likes item d so will Userl Furthermore, it was argued in [5] that in real life the way in which two people are said to be similar is no Table 1: Input data required by different CF based solely on whether they have complimentary t data pinions on a specific subject, e. g, movie ratings, bu also on other factors, such as their background and User ratings(by a single user) Item-based CF lifestyles. Therefore, when doing the profile matching, User ratings(by multi issues such as age, gender and preferences of movie User-based CF genres must also be taken into account. User ratings(by multiple users)+ a novel framework for user-based collaborative Hybrid CF emographic inforn filtering is proposed in [6] that enables (e.g. GA-based CF) en contents recommendation by groups of closely related User ratings(by multiple users) individuals. By using the rating information from a dividual user in a group can be predicted s of the group of closely related users, unrated iter During filtering, the input data are processed as features by some heuristic algorithms. In some hybrid recommendation systems, all features are included. But 1. 1. Hybrid Model not all features contribute significantly to the quality of recommendation output; some are noises that bring Several recommendation systems(e.g. [7J[8) use a down the prediction accuracy. Moreover,every user hybrid approach by combining collaborative and places a different importance priority on each feature content-based methods, which helps to avoid certain These priorities are enumerated as referred to feature Different ways to combine collaborative and content based methods into a hybrid recommender system can mature users, then his feature weight for age would be be classified as follows. higher than other features
(a) Content-based Filtering (b) Item-based Collaborative Filtering (c) User-based Collaborative Filtering Figure 1. Examples of different Filtering With item-based collaborative filtering as shown in Figure 1b, if many users (User2 and User3 in this example) like Item A and Item D, we assume that Item A is highly correlated with Item D. If User1 likes Item A, he should also like Item D given the strong link of association between Items A and D. Item based Collaborative Filtering is quite common for current recommendation systems, which have been wildly used by Movie Lens and Yahoo. According to papers [3][4], item-based techniques first analyze a user-item matrix to identify relationships between different items, and then use these relationships to indirectly compute recommendations for users. With user-based collaborative filtering as shown in Figure 1c, the taste of User3 is very similar to that of User1 because they have preferred Item A and Item B in common. They are deemed to be like-minded users. Hence User 3 likes Item D so will User1. Furthermore, it was argued in [5] that in real life the way in which two people are said to be similar is not based solely on whether they have complimentary opinions on a specific subject, e.g., movie ratings, but also on other factors, such as their background and lifestyles. Therefore, when doing the profile matching, issues such as age, gender and preferences of movie genres must also be taken into account. A novel framework for user-based collaborative filtering is proposed in [6] that enables recommendation by groups of closely related individuals. By using the rating information from a group of closely related users, unrated items of the individual user in a group can be predicted. 1.1. Hybrid Model Several recommendation systems (e.g. [7][8]) use a hybrid approach by combining collaborative and content-based methods, which helps to avoid certain limitations of content-based and collaborative systems. Different ways to combine collaborative and contentbased methods into a hybrid recommender system can be classified as follows: 1. Implementing collaborative and content-based methods separately and combining their predictions, 2. Incorporating some content-based characteristics into a collaborative approach, 3. Incorporating some collaborative characteristics into a content-based approach, and 4. Constructing a general unifying model that incorporates both content-based and collaborative characteristics. One thing in common across the above four fusion approaches is that a large amount of input data is required. Combining several filtering approaches implies more than one aspect of input data might need to be incorporated into the recommender system, such as the users’ information, item contents and users’ ratings. A table summarizing what input data is required for the filtering approach to work on is shown below. Table 1: Input data required by different CF Filtering approach Input data Content-based analysis Item contents + User ratings (by a single user) Item-based CF Item contents + User ratings (by multiple users) User-based CF User demographic information + User ratings (by multiple users) + Hybrid CF (e.g. GA-based CF) User demographic information + Item contents + User ratings (by multiple users) During filtering, the input data are processed as ‘features’ by some heuristic algorithms. In some hybrid recommendation systems, all features are included. But not all features contribute significantly to the quality of recommendation output; some are noises that bring down the prediction accuracy. Moreover, every user places a different importance priority on each feature. These priorities are enumerated as referred to feature weights. For example, if a mature user prefers to be given recommendations based on the opinions of other mature users, then his feature weight for age would be higher than other features
1. 2. Research motivation fittest individuals. a collection of individuals are represented by chromosomes which are coded in The main objective of this research work is numeric real-value commander that exploits advantages of hybrid CF for high quality prediction and 2.1. Chromosome Encoding recommendation. at the same time ove limitations inherited from hybrid CF technique In order to implement a recommendation Genetic algorithm-based approach is proposed here a case study of Movie recommender, we apply Genetic for it is one of the most powerful search techniques for Algorithm to find the fine-tuned feature weights so that finding an optimized solution of a matching item to be each feature used in C can be characterized; the recommended to the user. In order to support an chromosome structure is presented as in Figure 2 extensive range of input data required for Hybrid CF This Ga chromosome structure is designed in mind (c f. Table 1), the GA approach allows the data to be that it could embrace the full range of input data from features into relativel Table 1 for supporting chromosome structure can be downsized to leave out With the chromosomes ready, GA searches for a some data components such as user demographic data solution that has the highest value of fitness function or item contents, in cases of Item-based CF or User (to be defined later). The feature weights which are based CF were in use respectivel relevant remain, noises are eliminated. During the early Feature I is the movie rating. Features 2-4 represent period of usage, there wasnt enough rating data which the user profile from MovieLens(source is known as the data sparsity problem [9]. We define a movielens umn. edu): age, gender and occupation are ay of solving this problem in our GA recommender the most influential demographic attributes that by coding an additional preference component in the describe the background characteristic of auser chromosome. So during the cold boot-up stage, the Features 5-12 represent the additional profile features preference component will dominate the search calculated in our work; they are the extra component function in the GA operation. dded into the chromosome. So they can be used in the d of 2. GA-based Collaborative Filtering Features 13-30 represent from MovieLens; features 3-37 represent the movie ga is a heuristic search method based on the mechanics of natural selection and genetic evolution, attributes from IMDb(source: imdb. com).Theya.the component of item content that describes about introduced by John Holland in 1975[10]. It maintains a movie. Both data from MovieLens and IMDb are used population of computing feature transformation in order to prevent any bias in any set of the data, for a matrices. By using selection, crossover and mutation fairer evaluation in our experiments ethods of GA, it finds the fitness value for picking the '量 w咧呦wws-Jwda2- Figure 2. Chromosome of full features Figure 6. Chromosome of 7 features Figure 3. GA-based Recommendation System
1.2. Research Motivation The main objective of this research work is to design a recommender that exploits advantages of hybrid CF for high quality prediction and recommendation, at the same time overcoming the limitations inherited from hybrid CF techniques. Genetic algorithm-based approach is proposed here for it is one of the most powerful search techniques for finding an optimized solution of a matching item to be recommended to the user. In order to support an extensive range of input data required for Hybrid CF (c.f. Table 1), the GA approach allows the data to be encoded as ‘features’ into relatively elongated chromosomes. With the chromosomes ready, GA searches for a solution that has the highest value of fitness function (to be defined later). The feature weights which are relevant remain, noises are eliminated. During the early period of usage, there wasn’t enough rating data which is known as the data sparsity problem [9]. We define a way of solving this problem in our GA recommender by coding an additional ‘preference’ component in the chromosome. So during the cold boot-up stage, the preference component will dominate the search function in the GA operation. 2. GA-based Collaborative Filtering GA is a heuristic search method, based on the mechanics of natural selection and genetic evolution, introduced by John Holland in 1975 [10]. It maintains a population of computing feature transformation matrices. By using selection, crossover and mutation methods of GA, it finds the fitness value for picking the fittest individuals. A collection of individuals are represented by chromosomes which are coded in numeric real-value. 2.1. Chromosome Encoding In order to implement a recommendation system as a case study of Movie recommender, we apply Genetic Algorithm to find the fine-tuned feature weights so that each feature used in CF can be characterized; the chromosome structure is presented as in Figure 2. This GA chromosome structure is designed in mind that it could embrace the full range of input data from Table 1 for supporting hybrid CF. Optionally the chromosome structure can be downsized to leave out some data components such as user demographic data or item contents, in cases of Item-based CF or Userbased CF were in use respectively. Feature 1 is the movie rating. Features 2-4 represent the user profile from MovieLens (source: movielens.umn.edu); age, gender and occupation are the most influential demographic attributes that describe the background characteristic of a user. Features 5-12 represent the additional profile features calculated in our work; they are the extra component added into the chromosome. So they can be used in the initial period of usage when the movie ratings are scarce. Features 13-30 represent the movie genres from MovieLens; features 3-37 represent the movie attributes from IMDb (source: imdb.com). They are the component of item content that describes about the movie. Both data from MovieLens and IMDb are used in order to prevent any bias in any set of the data, for a fairer evaluation in our experiments. w1 w2 w3 w4 w5 … w12 Rating Age Gender Occupation type Preferred film 8 user profile features w13 … w30 w31 … w37 Action 18 genres language Preferred Western Director 7 characters Language w1 w2 w3 w4 w5 … w12 Rating Age Gender Occupation type Preferred film 8 user profile features w13 … w30 w31 … w37 Action 18 genres language Preferred Western Director 7 characters Language Figure 2. Chromosome of full features w1 w2 w3 w4 w5 Rating Prefer Director w6 w7 Prefer Actress Prefer Actor Prefer Producer Prefer Writer Prefer Editor w1 w2 w3 w4 w5 Rating Prefer Director w6 w7 Prefer Actress Prefer Actor Prefer Producer Prefer Writer Prefer Editor Figure 6. Chromosome of 7 features Profile Capture: User registered to provide the demographics New user Movie Database Collect Movie Info Recommendations Data Collect selected items from neighbors Search for neighbors from user database User Database Neighbor Set GA Recommender Get Top N neighbors Collect Ratings: Get ratings of movies Create New User Profile Features Phase I Phase II Phase III: Recommender Profile Capture: User registered to provide the demographics New user Movie Database Collect Movie Info Recommendations Data Collect selected items from neighbors Search for neighbors from user database User Database Neighbor Set GA Recommender Get Top N neighbors Collect Ratings: Get ratings of movies Create New User Profile Features Phase I Phase II Phase III: Recommender Figure 3. GA-based Recommendation System
2.2. GA Recommender System Architecture Each User record We constructed a GA-based recommendation System as a prototype for conducting experiments. The Movie database experimental protocol is capable of gathering, disseminating, and using ratings from some users to Accumulate the no of times predict other users'interest in movies. Figure 3 shows appear for these genres the architecture of our system that can be deployed as online hybrid CF application implemented in GA gorithm. The process is divided into three phases Phase 1: Collect User Information For a new visitor, the system requests him to register an account as to collect his personal particulars as well Find maximum no. of genres as the ratings of a set of movies. Each time the user has watched or purchased a movie, he would be asked to Update "Prefer Film Type ate how good it is. Phase ll: Create User Profile features Figure 4. Flow chart of creating a feature For those movies the user rated we search for the corresponding genres and characters from the movie Phase III: GA Recommender database. In addition to using users' ratings, we try to This phase uses Ga functions to search for the e apture what types of items are most preferred by the appropriate recommendation by selecting and weighing ser. In the case of movie recommendation. the the features. Figure 5 shows the workflow of the preferred types of items are the movie genres such as recommender that the best subset can be gained from the following, Preferred Film Type, Preferred Director, the full data set by processing Feature Selection Preferred Actress, Preferred Actor, Preferred Producer, Feature Weighting and Recommendation Generation Preferred Writer, Preferred Editor and Preferred Language They form the additional 8 features to be stored in the FullDataSet ser profile(c f. Figure 2), which will be coded into the GA chromosome Accuracy abstract of what the user pree tings alone, an explicit feature weight that helps in the recommendation search process. This is particularly effective when the ratin records were in scarce during the early stage This is how it works on extracting the 8 features ith respect to each user, we sum up the total number (frequency) appears on each genre, the one with Algorithm maximum number is assumed to be most favoured by the user and marked as Prefer_ Film Type. If the sam igure 5. Workflow of GA Recommender lumber of genres exists, we would get the latest one as we suppose that the last record reflects the latest user Feature selection is the process that chooses an preference. A flowchart that shows how optimal subset of features according to a certain Prefer_ FiIm Type is computed is shown in Figure 4. The criterion. as there are thousands of features in the same logic applies to the other features can reduce the dimensionality. For example, in a users record, the most frequent eliminate noise and save search time ype is Musical; then we assume that the preferred film The first step in the GA Recommender phase is to ype of this user is Musical. In another example, if most prepare the features that are needed. For hybrid GA- of the movies that the user watched are found to be based CF. all the 37 features should be included directed by"Steven Spielberg,, then we assume the However, from our evaluation experiments, we found preferred director of the user is'Steven Spielberg that some of the user profile attributes are more This concept of extracting user preferences from the effective than the others. Good quality of majority of items that he frequented, should be generic recommendation can be produced by using a minimum enough to apply ther recommend 12 features out of ning, Ag han movie advisor Prefer FilmType, Prefer Director, Prefer Actress, Prefer
2.2. GA Recommender System Architecture We constructed a GA-based Recommendation System as a prototype for conducting experiments. The experimental protocol is capable of gathering, disseminating, and using ratings from some users to predict other users' interest in movies. Figure 3 shows the architecture of our system that can be deployed as an online hybrid CF application implemented in GA algorithm. The process is divided into three phases: • Phase I: Collect User Information For a new visitor, the system requests him to register an account as to collect his personal particulars as well as the ratings of a set of movies. Each time the user has watched or purchased a movie, he would be asked to rate how good it is. • Phase II: Create User Profile features For those movies the user rated, we search for the corresponding genres and characters from the movie database. In addition to using users’ ratings, we try to capture what types of items are most preferred by the user. In the case of movie recommendation, the preferred types of items are the movie genres such as the following, Preferred Film Type, Preferred Director, Preferred Actress, Preferred Actor, Preferred Producer, Preferred Writer, Preferred Editor and Preferred Language. They form the additional 8 features to be stored in the user profile (c.f. Figure 2), which will be coded into the GA chromosome. Instead of relying on users’ ratings alone, an explicit abstract of what the user prefers often offers a good feature weight that helps in the recommendation search process. This is particularly effective when the rating records were in scarce during the early stage. This is how it works on extracting the 8 features: with respect to each user, we sum up the total number (frequency) appears on each genre, the one with maximum number is assumed to be most favoured by the user and marked as Prefer_FilmType. If the same number of genres exists, we would get the latest one as we suppose that the last record reflects the latest user preference. A flowchart that shows how Prefer_FilmType is computed is shown in Figure 4. The same logic applies to the other features. For example, in a user’s record, the most frequent type is Musical; then we assume that the preferred film type of this user is Musical. In another example, if most of the movies that the user watched are found to be directed by ‘Steven Spielberg’, then we assume the preferred director of the user is ‘Steven Spielberg’. This concept of extracting user preferences from the majority of items that he frequented, should be generic enough to apply on other recommendation domains than movie advisor. Figure 4. Flow chart of creating a feature • Phase III: GA Recommender This phase uses GA functions to search for the appropriate recommendation by selecting and weighing the features. Figure 5 shows the workflow of the recommender that the best subset can be gained from the full data set by processing Feature Selection, Feature Weighting and Recommendation Generation. Figure 5. Workflow of GA Recommender Feature selection is the process that chooses an optimal subset of features according to a certain criterion. As there are thousands of features in the database, selection can reduce the dimensionality, eliminate noise and save search time. The first step in the GA Recommender phase is to prepare the features that are needed. For hybrid GAbased CF, all the 37 features should be included. However, from our evaluation experiments, we found that some of the user profile attributes are more effective than the others. Good quality of recommendation can be produced by using a minimum 12 features out of 37: Rating, Age, Gender, Occupation, Prefer FilmType, Prefer Director, Prefer Actress, Prefer
Actor, Prefer Producer, Prefer Writer, Prefer Editor and vaif is the value of feature f on movie i for user a Prefer Language. Feature weighting is the next step after feature Sum of the weights should be =∑w= where n selection. Each gene in the chromosome is represented represents the total number of features: W represents by a feature weight w in real value. The heavier the sum of all the weights in a chromosome. As the calculation goes, infeasible chromosomes will appear. weight the more important the feature is, so that the To solve this problem, we applied the Repair example, the weight of the feature Prefer Director is the Algorithm by Okan Yilmaz [11] highest, that indicates the user favors over his choice of movies by certain movie directors. We programmed in 3. Experiments GALib (lancet. mit. edu/ga) to find out the feature weight w, distance measure d and obtain a group of We want to evaluate the effectiveness of our ga neighbor set for the active user by choosing half of the based recommender, by comparing the traditional top scores from the users of similar taste Collaborative Filtering method using Pear After we obtained the neighbor set, we can provide ficient and our proposed hybrid schemes using the active user a list of recommendations by Genetic Algorithm. We repeated the experiments by summarizing the neighbor's rated movies which have using different sets of features as to show how and not been rated by the active user. Also we can predict which feature weights affect the fitness and the votes of those movies. Predicting of the votes helps performance guiding the user to choose his favorites by ranking the votes. The vote for movie i for active user a can be 3. 1. Experiment I- Features Weights predict_vote(a, i)=v,+ euclidean(a, j)v1-w) 10 users to test the importance of features to each user We calculated the weights for each feature. and took here. Va is the mean vote of active user a In Figure 7, we show the average weights for each k is the normalizing factor feature on different users. The average feature weights n is the size of neighbor set are about 0. 2649. The separation of two groups of lines V,I is the actual vote of neighbor j on movie i evident that the weights of user profile features are As we can also measure the predict vote for those much higher than the other movie attrit novies that the active user rated before. we can the features prefer director, prefer actress, prefer actor, compare it with the actual vote to cross-check the prefer producer, prefer writer, prefer editor: all of their fitness rate for user a on movie i. feature weights are over 0.3. By this observation, we te(a,i)×100% confirm that rating, age, gender, occupation, prefer film producer, prefer writer, prefer editor are more relevant to the user preference than the other features 2.3. Similarity Measure 3. 2. Experiment I- Fitness Accuracy For obtaining the neighbor set of the active user, similarity measure is used as to reflect the affinity For testing the fitness accuracy of Pearson between users. Firstly, we collect those users who Algorithm and Genetic Algorithm for CF, and also the shared common movies with the active user. Then the effectiveness of different features on GA, we distance measure d and feature weight w between the configured a variety of five different types of active user and his neighbors can be found by the chromosomes in this experiment and applied them on Euclidean distance function 50 random users respectively. The chromosome feature d(a,)= compositions of the five types are shown in Table 2 The different combinations represent on how different a is the active user, j is the neighbor, and a tj types of filtering as shown in Table 1 can be made up i is the common movie between a and j by including various input data(such as Item Contents is the total number of common movies User Ratings and User Demographics) f is the feature number One sample chromosome with combination of 7 n is the total number of features features that is constructed by preprocessing the user wris the weight of feature f for user a profile data pertaining to the movie attributes is shown
Actor, Prefer Producer, Prefer Writer, Prefer Editor and Prefer Language. Feature weighting is the next step after feature selection. Each gene in the chromosome is represented by a feature weight w in real value. The heavier the weight the more important the feature is, so that the value can reflect the feature importance to the user. For example, the weight of the feature Prefer Director is the highest, that indicates the user favors over his choice of movies by certain movie directors. We programmed in GALib (lancet.mit.edu/ga) to find out the feature weight w, distance measure d and obtain a group of neighbor set for the active user by choosing half of the top scores from the users of similar taste. After we obtained the neighbor set, we can provide the active user a list of recommendations by summarizing the neighbor’s rated movies which have not been rated by the active user. Also we can predict the votes of those movies. Predicting of the votes helps guiding the user to choose his favorites by ranking the votes. The vote for movie i for active user a can be predicted by: ∑ = = + − n j predict vote a i v a k euclidean a j vj i vj 1 _ ( ), ( , )( , ) where: v a is the mean vote of active user a k is the normalizing factor n is the size of neighbor set vj , i is the actual vote of neighbor j on movie i As we can also measure the predict vote for those movies that the active user rated before, we can compare it with the actual vote to cross-check the fitness rate for user a on movie i: _ ( , ) _ ( , ) 100% ( , ) actual vote a i predict vote a i fitness a i × = 2.3. Similarity Measure For obtaining the neighbor set of the active user, similarity measure is used as to reflect the affinity between users. Firstly, we collect those users who shared common movies with the active user. Then the distance measure d and feature weight w between the active user and his neighbors can be found by the Euclidean Distance function: ∑ ∑ ( ) = = = − z i n f d a j wf va i f vj i f 1 2 1 ( , ) , , , , where a is the active user, j is the neighbor, and a ≠ j i is the common movie between a and j z is the total number of common movies f is the feature number n is the total number of features wf is the weight of feature f for user a va,i,f is the value of feature f on movie i for user a. Sum of the weights should be ∑= = = n i W wi 1 1 where n represents the total number of features; W represents the sum of all the weights in a chromosome. As the calculation goes, infeasible chromosomes will appear. To solve this problem, we applied the Repair Algorithm by Okan Yilmaz [11]. 3. Experiments We want to evaluate the effectiveness of our GAbased recommender, by comparing the traditional Collaborative Filtering method using Pearson Coefficient and our proposed hybrid schemes using Genetic Algorithm. We repeated the experiments by using different sets of features as to show how and which feature weights affect the fitness and performance. 3.1. Experiment I - Features Weights We calculated the weights for each feature, and took 10 users to test the importance of features to each user. In Figure 7, we show the average weights for each feature on different users. The average feature weights are about 0.2649. The separation of two groups of lines is evident that the weights of user profile features are much higher than the other movie attributes. Especially the features prefer director, prefer actress, prefer actor, prefer producer, prefer writer, prefer editor; all of their feature weights are over 0.3. By this observation, we confirm that rating, age, gender, occupation, prefer film type, prefer director, prefer actress, prefer actor, prefer producer, prefer writer, prefer editor are more relevant to the user preference than the other features. 3.2. Experiment II - Fitness Accuracy For testing the fitness accuracy of Pearson Algorithm and Genetic Algorithm for CF, and also the effectiveness of different features on GA, we configured a variety of five different types of chromosomes in this experiment and applied them on 50 random users respectively. The chromosome feature compositions of the five types are shown in Table 2. The different combinations represent on how different types of filtering as shown in Table 1 can be made up by including various input data (such as Item Contents, User Ratings and User Demographics). One sample chromosome with combination of 7 features that is constructed by preprocessing the user profile data pertaining to the movie attributes is shown in Figure 6
From Figure 8, we can observe apparently that the recommendation than that of Pearson Algorithm. By accuracy of prediction of GA is much higher than that applying user profile features that are more significant of pearson The average fitness of pearson is abou than other features such as movie genres on GA, the 69.2%, whereas the average fitness for Ga with various similarity measure finds the neighbors with similar combinations of features ranges from 81.28% to taste to the user; as a result, the user preference can be 81.77%. In particular, the fitness of ' GA UserPro12 is better predicted. And when user profile features are the highest, about 18.21% better than that of Pearson used alone, the process speeds up. Our experiment also C Figure 9 shows the processing time for running the shows the GA fitness keeps constant when neighbor set algorithm on different features. ' GA UserPro7'that size increases. This implies a positive prospect in the is Ga with seven features out-performs the other four scalability and speed issues of GA recommender in terms of speed. Its average processing time is 19 seconds. This is a 67.53% reduction over GA with 22 5. References reatures which implements Content Analysis. The longest time taken is by GA UserPro37'and [11 R Meteren, M Someren. "Using Content-based the shortest time is 'GA UserPro7,. this reinforces the Filtering for Recommendation", Proceedings of MLnet belief that when more features are integrated into the /ECML2000 Workshop, 2000 recommender a longer processing time it takes [21 K. Iwahama, Y Hijikata, S Nishida, ""Content-based Filtering System for Music Data", Proceedings of 3.3. Experiment Ill- Neighbor Set 2004 Intenational Symposium on Applications and the Internet Workshops, pp. 480-487, 2004 We tested the performance on different group sizes [3] B Sarwar, G Karypis, J. Konston and JRiedl,"Item- of neighbor set, from 10 to 100 respectively Based Collaborative Filtering Recommendation 3.3.1. Process time vS. Neighbor Set. Figure 10 Algorithms", Proceedings of the 10th shows the performance of different features running on Conference on World Wide Web, pp 285-295, 2001 Ga with different sizes of neighbor sets. At the [41 M. Deshpande G Karypis. " Item-Based Top-N beginning, their performances are close. As the ecommendation Algorithms "ACM Transactions or neighbor set size s. the Information Systems, Volume 22. Issue 1 (January GA, GA Merge,‘ GA UserPro37’ Increase quite 2004),pp.l43-177,2004 sharply. The additional features they have in common 5I S. Ujjin, J.P. Bentley,"Particle swarm optimization are the 18 movie genres recommender system. IEEE International conference As we can see. GA UserPro37' has 37 features. the process time increases gradually as the neighbor set [6] G Xue, C Lin, Q Yang, w.Xi, H Zeng, Y Yu and xpands; whereas for "GA UserPro7 with 7 features, Z Chen. "Scalable Collaborative Filtering using the process time increases slowly 3.3.2. Fitness vs. Neighbor Set. In Figure ll, the Conference on Research and Development in fitness of different features rise up gradually as the Information Retrieval SIGIR 05, pp 114-121, 2005 neighbor set size expands. Interestingly when th [71 Y Shih, D Liu, "Hybrid Re neighbor set reaches over the size of 35. the fitness Approaches: Collaborative Filtering via Valuable continues to stay constant. As indicated by the dotted Content Information, Proceedings of the 38th Annual Hawaii International Conference on System Sciences, line in the chart, the fitness approaches at Track & Volume 08, pp 217-220. 2005 approximately 95% at the turning point. Further increase on the neighbor set size has no effect. [81 M. Claypool, A Gokhale, T. Miranda, Combining Content-Based and Collaborative filters in an Online Newspaper", ACM SIGIR Workshop on Recommender 4. Conclusion Systems, August 19, 1999, USA [91 E Han, G. Karypis, " Feature-Based Recommendation We proposed and evaluated using Gafor System", In Proceeding of ACM CIKM'05 pp 446. supporting hybrid CF to recommend new items to a 452, October 2005, Bremen, Germany particular user based on his previous likings or the [101 S Ujin, P.J. Bentley, "Learning User Preferences pinions of other like-minded users. GA-based CI Using Evolution", 4th Asia-Pacific Conference on provides reasonably good recommendation accuracy. simulated Evolution and Learning, Singapore. The performance can be further improved by [111 O. Yilmaz, L. Tuaf, "Data Mining Feature Subset ncorporating user profile features in the chromosomes. Weighting and Selection using genetic algorithms As our experiments show, Ga offers more accurate Thesis, Air Force Institute of Technology, pp 55-
From Figure 8, we can observe apparently that the accuracy of prediction of GA is much higher than that of Pearson. The average fitness of Pearson is about 69.2%, whereas the average fitness for GA with various combinations of features ranges from 81.28% to 81.77%. In particular, the fitness of ‘GA UserPro12’ is the highest, about 18.21% better than that of Pearson. Figure 9 shows the processing time for running the GA algorithm on different features. ‘GA UserPro7’ that is GA with seven features out-performs the other four in terms of speed. Its average processing time is 19 seconds. This is a 67.53% reduction over GA with 22 features which implements Content Analysis. The longest time taken is by ‘GA UserPro37’ and the shortest time is ‘GA UserPro7’. This reinforces the belief that when more features are integrated into the recommender a longer processing time it takes. 3.3. Experiment III - Neighbor Set We tested the performance on different group sizes of neighbor set, from 10 to 100 respectively. 3.3.1. Process time vs. Neighbor Set. Figure 10 shows the performance of different features running on GA with different sizes of neighbor sets. At the beginning, their performances are close. As the neighbor set size increases, the processing times for ‘GA’, ‘GA Merge’, ‘GA UserPro37’ increase quite sharply. The additional features they have in common are the 18 movie genres. As we can see, ‘GA UserPro37’ has 37 features, the process time increases gradually as the neighbor set expands; whereas for ‘GA UserPro7’ with 7 features, the process time increases slowly. 3.3.2. Fitness vs. Neighbor Set. In Figure 11, the fitness of different features rise up gradually as the neighbor set size expands. Interestingly when the neighbor set reaches over the size of 35, the fitness continues to stay constant. As indicated by the dotted line in the chart, the fitness approaches at approximately 95% at the turning point. Further increase on the neighbor set size has no effect. 4. Conclusion We proposed and evaluated using GA for supporting hybrid CF to recommend new items to a particular user based on his previous likings or the opinions of other like-minded users. GA-based CF provides reasonably good recommendation accuracy. The performance can be further improved by incorporating user profile features in the chromosomes. As our experiments show, GA offers more accurate recommendation than that of Pearson Algorithm. By applying user profile features that are more significant than other features such as movie genres on GA, the similarity measure finds the neighbors with similar taste to the user; as a result, the user preference can be better predicted. And when user profile features are used alone, the process speeds up. Our experiment also shows the GA fitness keeps constant when neighbor set size increases. This implies a positive prospect in the scalability and speed issues of GA recommender. 5. References [1] R. Meteren, M. Someren, “Using Content-based Filtering for Recommendation”, Proceedings of MLnet / ECML2000 Workshop, 2000 [2] K. Iwahama, Y. Hijikata, S. Nishida, “Content-based Filtering System for Music Data”, Proceedings of the 2004 International Symposium on Applications and the Internet Workshops, pp. 480 - 487, 2004 [3] B. Sarwar, G. Karypis, J. Konston and J. Riedl, “ItemBased Collaborative Filtering Recommendation Algorithms”, Proceedings of the 10th International Conference on World Wide Web, pp. 285 - 295, 2001 [4] M. Deshpande, G. Karypis, “Item-Based Top-N Recommendation Algorithms”, ACM Transactions on Information Systems, Volume 22, Issue 1 (January 2004), pp. 143 – 177, 2004. [5] S. Ujjin, J. P. Bentley, “Particle swarm optimization recommender system”, IEEE International conference on Evolutionary Computation, pp. 124-131, 2003 [6] G. Xue, C. Lin, Q. Yang, W. Xi, H. Zeng, Y. Yu and Z. Chen, “Scalable Collaborative Filtering Using Cluster-based Smoothing”, International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR ’05, pp. 114 – 121, 2005 [7] Y. Shih, D. Liu, “Hybrid Recommendation Approaches: Collaborative Filtering via Valuable Content Information”, Proceedings of the 38th Annual Hawaii International Conference on System Sciences, Track 8 - Volume 08, pp. 217-220, 2005 [8] M. Claypool, A. Gokhale, T. Miranda, “Combining Content-Based and Collaborative Filters in an Online Newspaper”, ACM SIGIR Workshop on Recommender Systems, August 19, 1999, USA [9] E. Han, G. Karypis, “Feature-Based Recommendation System”, In Proceeding of ACM CIKM'05, pp. 446- 452, October 2005, Bremen, Germany [10] S. Ujjin, P. J. Bentley, “Learning User Preferences Using Evolution”, 4th Asia-Pacific Conference on simulated Evolution and Learning, Singapore. [11] O. Yilmaz, L. Tuaf, “Data Mining Feature Subset Weighting and Selection using Genetic Algorithms”, Thesis, Air Force Institute of Technology, pp. 55-59
Table 2. Compositions of chromosome types Chromosome featu Item Contents Chromosome Type RatingDemographics.Preferences es)( 8 features)I 18 genres from MovieLens ures(GA User Profile 7) tic Algorithm with 12 features(GA User Profile 12) I Genetic Algorithm with 29 features( GA Merge) I Genetic Algorithm with 37 features(GA User Profile 37) Average Feature Weights of Neighbor Set 幸辛 Documentary igure 7 Feature Weights Chart of Neighbor Set Random 50 Users 50 Neighbors →0AM群9 0点U5Pro3 点 UspTo12 =点U3Pro Figure 8. Fitness Accuracy Chart for random 50 Users
Table 2. Compositions of chromosome types Chromosome Features Item Contents Chromosome Type User Rating User Demographics (3 features) User Preferences (8 features) 18 genres from MovieLens 7 characters from IMDb Pearson Algorithm (Pearson) Genetic Algorithm with 7 features (GA User Profile 7) Genetic Algorithm with 12 features (GA User Profile 12) Genetic Algorithm with 22 features (GA) Genetic Algorithm with 29 features (GA Merge) Genetic Algorithm with 37 features (GA User Profile 37) Figure 7. Feature Weights Chart of Neighbor Set Figure 8. Fitness Accuracy Chart for random 50 Users
dom 50 Users 50 Neigh aGA -GA User Pra37 -A-GA UserPro7 Actve User Figure 9. Performance Chart for Random 50 Users Process Time vs Neighbor Set 890E=W8 Figure 10. Processing Time verse Neighbor Set Chart Fitness vs Neighbor set igbor Set Figure 11. Fitness verse Neighbor Set Chart
Figure 9. Performance Chart for Random 50 Users Figure 10. Processing Time verse Neighbor Set Chart Figure 11. Fitness verse Neighbor Set Chart