Application of Dimensionality Reduction in Recommender System-A Case Study John t riedl Department of Computer Science and Engineering Army HPC Research Center University of Minnesota +1612625-4002 [sarwar, karypis, konstan, riedl]@cs. umn. edu meet many of the challenges of recommender Abstr We investigate the use of dimensionality reduction to I Introduction improve performance for a new class of data analysis software called "recommender systems Recommender systems apply knowledge discovery Recommender systems have evolved extremely interactive environment of the Web techniques to the problem of making product apply data recommendations during a live customer interaction analysis techniques to the problem of helping These systems are achieving widespread success in customers find which products they would like to E-commerce nowadays, especially with the advent of purchase at E-Commerce sites. For instance,a the Internet. The tremendous growth of customers recommender Amazon.com and products poses three key challenges for (www.amazoncom)suggestsbookstocustomers recommender systems in the E-commerce domain based on other books the customers have told These are: producing high quality recommendations, Amazon they like. another recommender syst performing many recommendations per second for CdnoW(www.cdnow.com)helpscustomerschoose millions of customers and products, and achieving CDs to purchase as gifts, based on other CDs the high coverage in the face of data sparsity. One recipient has liked in the past. In a sense, uccessful recommender system technology is recommender systems are an application of a collaborative filtering, which works by matching articular type of Knowledge Discovery in Databases ustomer preferences to other customers in making (KDD)(Fayyad et al. 1996)technique. KDD recommendations. Collaborative filtering has been systems use many subtle data analysis techniques to shown to produce high quality recommendations, but achieve two unsubtle goals. They are: (to save he performance degrades with the number money by discovering the potential for efficiencies, customers and products. New recommender system or(ii) to make more money by discovering ways to technologies are needed that can quickly prod sell more products to customers. For instance high quality recommendations, even for very large- companies are using KDd to discover which scale problems products sell well at which times of year, so they can manage their retail store inventory more efficiently This paper presents two different experiments where potentially saving millions of dollars ayear we have explored one technology called Singular (Brachman et al. 1996). Other companies are using Value Decomposition (SVD) to reduce the kdd to discover which customers will be most dimensionality of recommender system databases interested in a special offer, reducing the costs of Each experiment compares the quality of a direct mail or outbound telephone campaigns by recommender system using SVD with the quality of a hundreds of thousands of dollars a year recommender system using collaborative filtering (Bhattacharyya 1998, Ling et al. 1998). These he first experiment compares the effectiveness of pplications typically involve using Kdd to discover the two recommender systems at predicting consumer a new model, and having an analyst apply the model preferences based on a database of explicit ratings of to the application. However, the most direct benefit products. The second experiment compares the of Kdd to businesses is increasing sales of existing effectiveness of the two recommender systems at products by matching customers to the products they producing Top-N lists based on a real-life customer will be most likely to purchase. The Web presents urchase database from an E-Commerce site. Our new opportunities for KDD, but challenges KDD experience suggests that SVd has the potential to ystems to perform interactively. While a customer
Application of Dimensionality Reduction in Recommender System -- A Case Study Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl Department of Computer Science and Engineering / Army HPC Research Center University of Minnesota Minneapolis, MN 55455 +1 612 625-4002 {sarwar, karypis, konstan, riedl}@cs.umn.edu Abstract We investigate the use of dimensionality reduction to improve performance for a new class of data analysis software called “recommender systems”. Recommender systems apply knowledge discovery techniques to the problem of making product recommendations during a live customer interaction. These systems are achieving widespread success in E-commerce nowadays, especially with the advent of the Internet. The tremendous growth of customers and products poses three key challenges for recommender systems in the E-commerce domain. These are: producing high quality recommendations, performing many recommendations per second for millions of customers and products, and achieving high coverage in the face of data sparsity. One successful recommender system technology is collaborative filtering, which works by matching customer preferences to other customers in making recommendations. Collaborative filtering has been shown to produce high quality recommendations, but the performance degrades with the number of customers and products. New recommender system technologies are needed that can quickly produce high quality recommendations, even for very largescale problems. This paper presents two different experiments where we have explored one technology called Singular Value Decomposition (SVD) to reduce the dimensionality of recommender system databases. Each experiment compares the quality of a recommender system using SVD with the quality of a recommender system using collaborative filtering. The first experiment compares the effectiveness of the two recommender systems at predicting consumer preferences based on a database of explicit ratings of products. The second experiment compares the effectiveness of the two recommender systems at producing Top-N lists based on a real-life customer purchase database from an E-Commerce site. Our experience suggests that SVD has the potential to meet many of the challenges of recommender systems, under certain conditions. 1 Introduction Recommender systems have evolved in the extremely interactive environment of the Web. They apply data analysis techniques to the problem of helping customers find which products they would like to purchase at E-Commerce sites. For instance, a recommender system on Amazon.com (www.amazon.com) suggests books to customers based on other books the customers have told Amazon they like. Another recommender system on CDnow (www.cdnow.com) helps customers choose CDs to purchase as gifts, based on other CDs the recipient has liked in the past. In a sense, recommender systems are an application of a particular type of Knowledge Discovery in Databases (KDD) (Fayyad et al. 1996) technique. KDD systems use many subtle data analysis techniques to achieve two unsubtle goals. They are: (i) to save money by discovering the potential for efficiencies, or (ii) to make more money by discovering ways to sell more products to customers. For instance, companies are using KDD to discover which products sell well at which times of year, so they can manage their retail store inventory more efficiently, potentially saving millions of dollars a year (Brachman et al. 1996). Other companies are using KDD to discover which customers will be most interested in a special offer, reducing the costs of direct mail or outbound telephone campaigns by hundreds of thousands of dollars a year (Bhattacharyya 1998, Ling et al. 1998). These applications typically involve using KDD to discover a new model, and having an analyst apply the model to the application. However, the most direct benefit of KDD to businesses is increasing sales of existing products by matching customers to the products they will be most likely to purchase. The Web presents new opportunities for KDD, but challenges KDD systems to perform interactively. While a customer
is at the E-Commerce site, the recommender system depicts the neighborhood formation using a nearest nust learn from the customers behavior, develop a neighbor technique in a very simple two dimensional model of that behavior, and apply that model to pace. Notice that each user' s neighborhood is those recommend products to the customer. Recommender other users who are most similar to him as identified systems directly realize this benefit of KDD systems by the proximity measure. Neighborhoods need not n E-Commerce. They help consumers find the be symmetric. Each user has the best neighborhood products they wish to buy at the E-Commerce site for him. Once a neighborhood of users is found Collaborative filtering is the most successful particular products can be evaluated by forming a recommender system technology to date, and is used weighted composite of the neighbors' opinions of in many of the most successful recommender systems on the Web. including those at Amazon. com and CDnow. com These statistical approaches, known as automated collaborative filtering, typically rely upon ratings as The earliest implementations of collaborative numerical expressions of user preference. Several filtering, in systems such as Tapestry( Goldberg et ratings-based automated collaborative filtering al., 1992), relied on the opinions of people from a systems have been developed. The GroupLens close-knit community, such as an office workgroup Research system(Resnick et al. 1994) provides a collaborative filtering for large pseudonymous collaborative filtering solution for target us Illustration of the neighborhood formation process. The distance between the ser and every other user is computed and the closest-k users are chosen as the agram k=5) communities cannot depend on each person knowing Usenet news and movies. Ringo( Shardanand et al the others. Several systems use statistical technique 1995)and video Recommender(Hill et al. 1995)are to provide personal recommendations of documents email and generat by finding a group of other users, known as recommendations on music and movies respectively neighbors that have a history of agreeing with the Here we present the schematic diagram of the target user. Usually, neighborhoods are formed by architecture of the GroupLens Research collaborative Sys Ratin Ratings h Www HTMI ResponseServer generator Recomm- Figure 2. Recommender System Architecture Database applying proximity measures such as the Pearson filtering engine in figure 2. The user interacts with a correlation between the opinions of the users. These Web interface. The Web server software are called nearest-neighbor techniques. Figure 1 communicates with the recommender system to
is at the E-Commerce site, the recommender system must learn from the customer’s behavior, develop a model of that behavior, and apply that model to recommend products to the customer. Recommender systems directly realize this benefit of KDD systems in E-Commerce. They help consumers find the products they wish to buy at the E-Commerce site. Collaborative filtering is the most successful recommender system technology to date, and is used in many of the most successful recommender systems on the Web, including those at Amazon.com and CDnow.com. The earliest implementations of collaborative filtering, in systems such as Tapestry (Goldberg et al., 1992), relied on the opinions of people from a close-knit community, such as an office workgroup. However, collaborative filtering for large communities cannot depend on each person knowing the others. Several systems use statistical techniques to provide personal recommendations of documents by finding a group of other users, known as neighbors that have a history of agreeing with the target user. Usually, neighborhoods are formed by applying proximity measures such as the Pearson correlation between the opinions of the users. These are called nearest-neighbor techniques. Figure 1 depicts the neighborhood formation using a nearestneighbor technique in a very simple two dimensional space. Notice that each user’s neighborhood is those other users who are most similar to him, as identified by the proximity measure. Neighborhoods need not be symmetric. Each user has the best neighborhood for him. Once a neighborhood of users is found, particular products can be evaluated by forming a weighted composite of the neighbors’ opinions of that document. These statistical approaches, known as automated collaborative filtering, typically rely upon ratings as numerical expressions of user preference. Several ratings-based automated collaborative filtering systems have been developed. The GroupLens Research system (Resnick et al. 1994) provides a pseudonymous collaborative filtering solution for Usenet news and movies. Ringo (Shardanand et al. 1995) and Video Recommender (Hill et al. 1995) are email and web systems that generate recommendations on music and movies respectively. Here we present the schematic diagram of the architecture of the GroupLens Research collaborative filtering engine in figure 2. The user interacts with a Web interface. The Web server software communicates with the recommender system to 1 2 5 3 4 Figure 1: Illustration of the neighborhood formation process. The distance between the target user and every other user is computed and the closest-k users are chosen as the neighbors (for this diagram k = 5). Recommender System Customer Dynamic HTML generator WWW Server Recommendations Respons e Reques t Correlation Database Ratings Database Ratings Ratings Recommendations Figure 2. Recommender System Architecture
choose products to suggest to the user. The 2 Existing Recommender Systems recommender system, in this case a collaborative Approaches and their limitations filtering system, uses its database of ratings of products to form neighborhoods and make recommendations. The Web server software displays Most collaborative filtering based recommender the recommended products to the user build a neighborhood of likeminded customers. The Neighborhood formation scheme The largest Web sites operate at a scale that stresses usually uses Pearson correlation or cosine similarit the direct implementation of collaborative filtering as a measure of proximity(Shardanand et al. 1995. Model-based techniques(Fayyad et al., 1996)have Resnick et al. 1994). Once these systems determine the potential to contribute to recommender systems the proximity neighborhood they produce two types that can operate at the scale of these sites. However, of recommendations these techniques must be adapted to the real-time needs of the Web, and they must be tested in realistic 1. Prediction of how much a customer c will like a problems derived from Web access patterns. The product P. In case of correlation based present paper describes our experimental results in algorithm, prediction on product 'P for applying a model-based technique, Latent Semantic customer C' is computed by computing a weighted sum of co-rated items between C and Indexing(LSD), that uses a dimensionality reduction all his neighbors and then by adding C's average our recommender system. We use two data sets in rating to that. This can be expressed by the following formula(Resnick et al., 1994) our experiments to test the performance of the model based technique: a movie dataset and an e-commerce The contributions of this paper are 1. The details of how one model-based Here, rcy denotes the correlation between user C echnology, LSI/SVD, was applied to and neighbor J. JP is J's ratings on product P reduce dimensionality in recommende J and C are J and C's average ratings. The systems for generating predictions prediction is personalized for the customer C There are. however. some naive non 2. Using low dimensional representation to compute neighborhood for generating personalized prediction schemes where prediction, for example, is computed simply by recommendations taking the average ratings of items being 3. The results of our experiments with LSIISVd on two test data sets-our 2. Recommendation of a list of products for a Movielens test-bed and customer customer C. This is commonly known as top-M product purchase data from a large E recommendation. Once a neighborhood is commerce company ormed, the recommender system algorithm focuses on the products rated by the neighbors The rest of the paper is organized as follows. The and selects a list of N products that will be liked next section describes some potential problems by the customer associated with correlation-based collaborative These systems have been successful in several filtering models. Section 3 explores the possibilities of leveraging the latent semantic relationship in domains, but the algorithm is reported to have shown some limitations such as. customer-product matrix as a basis for prediction generation. At the same time it explains how we can Sparsity: Nearest neighbor algorithms rely upon take the advantage of reduced dimensionality to form exact matches that cause the algorithms to better neighborhood of customers. The section sacrifice recommende overage and following that delineates our experimental test-bed accuracy(Konstan et al., 1997. Sarwar et al experimental design, results and discussion about the 998). In particular, since the correlation improvement in quality and performance. Section 5 coefficient is only defined between customers oncludes the paper and provides directions for future who have rated at least two products in common many pairs of customers have no correlation at all (Billsus et al., 1998). In practice, many commercial recommender systems are used to
choose products to suggest to the user. The recommender system, in this case a collaborative filtering system, uses its database of ratings of products to form neighborhoods and make recommendations. The Web server software displays the recommended products to the user. The largest Web sites operate at a scale that stresses the direct implementation of collaborative filtering. Model-based techniques (Fayyad et al., 1996) have the potential to contribute to recommender systems that can operate at the scale of these sites. However, these techniques must be adapted to the real-time needs of the Web, and they must be tested in realistic problems derived from Web access patterns. The present paper describes our experimental results in applying a model-based technique, Latent Semantic Indexing (LSI), that uses a dimensionality reduction technique, Singular Value Decomposition (SVD), to our recommender system. We use two data sets in our experiments to test the performance of the modelbased technique: a movie dataset and an e-commerce dataset. The contributions of this paper are: 1. The details of how one model-based technology, LSI/SVD, was applied to reduce dimensionality in recommender systems for generating predictions. 2. Using low dimensional representation to compute neighborhood for generating recommendations. 3. The results of our experiments with LSI/SVD on two test data sets—our MovieLens test-bed and customerproduct purchase data from a large Ecommerce company. The rest of the paper is organized as follows. The next section describes some potential problems associated with correlation-based collaborative filtering models. Section 3 explores the possibilities of leveraging the latent semantic relationship in customer-product matrix as a basis for prediction generation. At the same time it explains how we can take the advantage of reduced dimensionality to form better neighborhood of customers. The section following that delineates our experimental test-bed, experimental design, results and discussion about the improvement in quality and performance. Section 5 concludes the paper and provides directions for future research. 2 Existing Recommender Systems Approaches and their Limitations Most collaborative filtering based recommender systems build a neighborhood of likeminded customers. The Neighborhood formation scheme usually uses Pearson correlation or cosine similarity as a measure of proximity (Shardanand et al. 1995, Resnick et al. 1994). Once these systems determine the proximity neighborhood they produce two types of recommendations. 1. Prediction of how much a customer C will like a product P. In case of correlation based algorithm, prediction on product ‘P’ for customer ‘C’ is computed by computing a weighted sum of co-rated items between C and all his neighbors and then by adding C's average rating to that. This can be expressed by the following formula (Resnick et al., 1994): å å Î - = + J CJ J rates CJ r J J r C pred C ( ) P P Here, rCJ denotes the correlation between user C and neighbor J. JP is J's ratings on product P. J and C are J and C's average ratings. The prediction is personalized for the customer C. There are, however, some naive nonpersonalized prediction schemes where prediction, for example, is computed simply by taking the average ratings of items being predicted over all users (Herlocker et al., 1999). 2. Recommendation of a list of products for a customer C. This is commonly known as top-N recommendation. Once a neighborhood is formed, the recommender system algorithm focuses on the products rated by the neighbors and selects a list of N products that will be liked by the customer. These systems have been successful in several domains, but the algorithm is reported to have shown some limitations, such as: · Sparsity: Nearest neighbor algorithms rely upon exact matches that cause the algorithms to sacrifice recommender system coverage and accuracy (Konstan et al., 1997. Sarwar et al., 1998). In particular, since the correlation coefficient is only defined between customers who have rated at least two products in common, many pairs of customers have no correlation at all (Billsus et al., 1998). In practice, many commercial recommender systems are used to
evaluatelargeproductsets(e.g.,amazon.com filtering agent solution, however, did not address the recommends books and CDnow recommends fundamental problem of poor relationships music albums). In these systems, even active like-minded but sparse-rating customers. We customers may have rated well under 1% of the recognized that the KDd research community had products (1% of 2 million books is 20,000 extensive experience learning from sparse databases books--a large set on which to have an After reviewing several KDD techniques, we decided Accordingly, to try applying Latent Semantic Indexing (si)to algorithms may be unable to make many reduce the dimensionality of our customer-product commendations for a particular user. This problem is known as reduced coverage, and is due to sparse ratings of neighbors. Furthermore, LSI is a dimensionality reduction technique that has the accuracy of recommendations may be poor been widely used in information retrieval (Ir)to because fairly little ratings data can be included solve the problems of synonymy and polysemy An example of a missed opportunity for quality (Deerwester et al. 1990). Given a term-document- is the loss of neighbor transitivity. If customers frequency matrix, LSI is used to construct two Paul and Sue correlate highly, and Sue also matrices of reduced dimensionality. In essence, these correlates highly with Mike, it is not necessarily matrices represent latent attributes of terms, as true that Paul and Mike will correlate. They may reflected by their occurrence in documents, and of have too few ratings in common or may even documents, as reflected by the terms that occur show a negative correlation due to a small We rying number of unusual ratings in common relationships among pairs of customers based on atings of products. By reducing the dimensionality Scalability: Nearest neighbor algorithms require of the product space, we can increase density and computation that grows with both the number of thereby find more ratings. Discovery of latent ustomers and the number of products. With relationship from the database may potentially solve millions of customers and products, a typical the synonymy problem in recommender systems web-based recommender system running LSI, which uses singular value decomposition as its existing algorithms will suffer serious scalability underlying matrix factorization algorithm, maps nicely into the collaborative filtering recommender algorithm challenge. Berry et al. (1995)point out that Synonymy: In real life scenario, different product names can refer to the similar objects the reduced orthogonal dimensions resulting from Correlation based recommender systems cant SVD are less noisy than the original data and capture the latent associations between the terms and find this latent association and treat these products differently. For example, let us consider documents. Earlier work(Billsus et al. 1998)took two customers one of them rates 10 different advantage of this semantic property to reduce the dimensionality of feature space. The reduced feature recycled letter pad products as high"and another customer rates 10 different recycled meno ad products " high". Correlation based predictions. The rest of this section presents the construction of SVD-based recommender algorithm recommender systems would see no match for the purpose of generating predictions and top-M between product sets to compute correlation and would be unable to discover the latent recommendations; the following section describes association that both of them like recycled office our experimental setup, evaluation metrics, and products 3.1 Singular Value Decomposition (SVD) 3 Applying SVD for Collaborative Filtering SVD is a well-known matrix factorization technique The weakness of Pearson nearest neighbor for large, that factors an m x n matrix r into three matrices sparse databases he following recommender system algorithms. Our first approach R=U·S·V attempted to bridge the sparsity by incorporating semi-intelligent filtering agents into the system Where, U and V are two orthogonal matrices of size (Sarwar et al., 1998, Good et al., 1999). These agents m x r and n x r respectively; r is the rank of the evaluated and rated each product, using syntactic matrix R. S is a diagonal matrix of size r x r having features. By providing a dense ratings set, they all singular values of matrix R as its diagonal entries helped alleviate coverage and improved quality. The All the entries of matrix S are positive and stored in
evaluate large product sets (e.g., Amazon.com recommends books and CDnow recommends music albums). In these systems, even active customers may have rated well under 1% of the products (1% of 2 million books is 20,000 books--a large set on which to have an opinion). Accordingly, Pearson nearest neighbor algorithms may be unable to make many product recommendations for a particular user. This problem is known as reduced coverage, and is due to sparse ratings of neighbors. Furthermore, the accuracy of recommendations may be poor because fairly little ratings data can be included. An example of a missed opportunity for quality is the loss of neighbor transitivity. If customers Paul and Sue correlate highly, and Sue also correlates highly with Mike, it is not necessarily true that Paul and Mike will correlate. They may have too few ratings in common or may even show a negative correlation due to a small number of unusual ratings in common. · Scalability: Nearest neighbor algorithms require computation that grows with both the number of customers and the number of products. With millions of customers and products, a typical web-based recommender system running existing algorithms will suffer serious scalability problems. · Synonymy: In real life scenario, different product names can refer to the similar objects. Correlation based recommender systems can't find this latent association and treat these products differently. For example, let us consider two customers one of them rates 10 different recycled letter pad products as "high" and another customer rates 10 different recycled memo pad products "high". Correlation based recommender systems would see no match between product sets to compute correlation and would be unable to discover the latent association that both of them like recycled office products. 3 Applying SVD for Collaborative Filtering The weakness of Pearson nearest neighbor for large, sparse databases led us to explore alternative recommender system algorithms. Our first approach attempted to bridge the sparsity by incorporating semi-intelligent filtering agents into the system (Sarwar et al., 1998, Good et al., 1999). These agents evaluated and rated each product, using syntactic features. By providing a dense ratings set, they helped alleviate coverage and improved quality. The filtering agent solution, however, did not address the fundamental problem of poor relationships among like-minded but sparse-rating customers. We recognized that the KDD research community had extensive experience learning from sparse databases. After reviewing several KDD techniques, we decided to try applying Latent Semantic Indexing (LSI) to reduce the dimensionality of our customer-product ratings matrix. LSI is a dimensionality reduction technique that has been widely used in information retrieval (IR) to solve the problems of synonymy and polysemy (Deerwester et al. 1990). Given a term-documentfrequency matrix, LSI is used to construct two matrices of reduced dimensionality. In essence, these matrices represent latent attributes of terms, as reflected by their occurrence in documents, and of documents, as reflected by the terms that occur within them. We are trying to capture the relationships among pairs of customers based on ratings of products. By reducing the dimensionality of the product space, we can increase density and thereby find more ratings. Discovery of latent relationship from the database may potentially solve the synonymy problem in recommender systems. LSI, which uses singular value decomposition as its underlying matrix factorization algorithm, maps nicely into the collaborative filtering recommender algorithm challenge. Berry et al. (1995) point out that the reduced orthogonal dimensions resulting from SVD are less noisy than the original data and capture the latent associations between the terms and documents. Earlier work (Billsus et al. 1998) took advantage of this semantic property to reduce the dimensionality of feature space. The reduced feature space was used to train a neural network to generate predictions. The rest of this section presents the construction of SVD-based recommender algorithm for the purpose of generating predictions and top-N recommendations; the following section describes our experimental setup, evaluation metrics, and results. 3.1 Singular Value Decomposition (SVD) SVD is a well-known matrix factorization technique that factors an m ´ n matrix R into three matrices as the following: Where, U and V are two orthogonal matrices of size m ´ r and n ´ r respectively; r is the rank of the matrix R. S is a diagonal matrix of size r ´ r having all singular values of matrix R as its diagonal entries. All the entries of matrix S are positive and stored in R = U × S ×V ¢
decreasing order of their magnitude. The matrices x k and the dimension of Sk"Vk' is k xn.To obtained by performing SVD are particularly useful compute the prediction we simply calculate the dot for our application because of the property that SVD product of the h row of URSk and the ph column of provides the best lower rank approximations of the SkVk and add the customer average back using the iginal matrix R. in terms of frobenius norm. It is following possible to reduce the rx r matrix S to have only largest diagonal values to obtain a matrix Sk, k < r. If )·√S4V(P) the matrices U and V are reduced accordingly, then the reconstructed matrix Rk= Uk. Sk vk is the closest Note that even though the Rnom matrix is dense, the rank-k matrix to R. In other words, Rk minimizes the pecial structure of the matrix NPR allows us to use Frobenius norm IR-RAll over all rank-k matrices sparse SVD algorithms (e. g, Lanczos) whose most linear to the number of non- We use SVD in recommender systems to perform zeros in the original matrix R two different tasks: First, we use SVD to capture latent relationships between customers and products that allow us to compute the predicted likeliness of a 3.1.2 Recommendation generation certain product by a customer. Second, we use SVD In our second experiment, we look into the prospects to produce a low-dimensional representation of the of using low-dimensional space as a basis for original customer-product space and then compute neighborhood formation and using the neighbors neighborhood in the reduced space. We then used opinions on products they purchased we recommend that to generate a list of top-N product a list of N products for a given customer. For commendations for customers. The following is a purpose we consider customer preference data as description of our experiments binary by treating each non-zero entry of the customer-product matrix as 1. This means that we are 3.1.1 Prediction Generation only interested in whether a customer consumed a We start with a customer-product ratings matrix that particular product but not how much he/she liked that is very sparse, we call this matrix R. To capture meaningful latent relationship we first removed sparsity by filling our customer-product ratings Neighborhood formation in the reduced space matrix. We tried two different approaches: using the The fact that the reduced dimensional representation average ratings for a customer and using the average of the original space is less sparse than its high ratings for a product. We found the product average dimensional counterpart led us to form the produce a better result. We also considered two neighborhood in that space. As before, we started normalization techniques: conversion of ratings to z with the original customer-product matrix A, and then scores and subtraction of customer average from each used Svd to produce three decomposed matrices U, ng. We found the latter approach to provide S, and v. We then reduced S by retaining only better results. After normalization we obtain a filled eigenvalues and obtained Sk. Accordingly, we normalized matrix Rnorm. Essentially, Rnorm =R+NPR, performed dimensionality reduction to obtain Uk and where NPR is the fill-in matrix that provides naive Vk like the previous method, we finally computed non-personalized recommendation. We factor the the matrix product UASk". This m x k matrix is the k matrix Rnorm and obtain a low-rank approximation dimensional representation of m customers. We then after applying the following steps described in performed vector similarity (cosine similarity)to (Deerwester et al. 1990) form the neighborhood in that reduced space sing SVd to obtain U, Sand I Top-N Recommendation generation reduce the matrix S to dimension k Once the neighborhood is formed we concentrate on compute the square-root of the reduced the neighbors of a given customer and analyze the matrix Sk, to obtain Sk oroducts they purchased to recommend N products compute two resultant matrices: UKSkand the target customer is most likely to purchase. After SkVK' computing the neighborhood for a given customer C we scan through the purchase record of each of the k These resultant matrices can now be used to compute neighbors and perform a frequency count on the the recommendation score for any customer c and products they purchased. The product list is then product p. Recall that the dimension of UKSk is m sorted and most frequently purchased N items are eturned as recommendations for the target customer
decreasing order of their magnitude. The matrices obtained by performing SVD are particularly useful for our application because of the property that SVD provides the best lower rank approximations of the original matrix R, in terms of Frobenius norm. It is possible to reduce the r ´ r matrix S to have only k largest diagonal values to obtain a matrix Sk , k < r. If the matrices U and V are reduced accordingly, then the reconstructed matrix Rk = Uk .Sk .Vk ¢ is the closest rank-k matrix to R. In other words, Rk minimizes the Frobenius norm ||R- Rk || over all rank-k matrices. We use SVD in recommender systems to perform two different tasks: First, we use SVD to capture latent relationships between customers and products that allow us to compute the predicted likeliness of a certain product by a customer. Second, we use SVD to produce a low-dimensional representation of the original customer-product space and then compute neighborhood in the reduced space. We then used that to generate a list of top-N product recommendations for customers. The following is a description of our experiments. 3.1.1 Prediction Generation We start with a customer-product ratings matrix that is very sparse, we call this matrix R. To capture meaningful latent relationship we first removed sparsity by filling our customer-product ratings matrix. We tried two different approaches: using the average ratings for a customer and using the average ratings for a product. We found the product average produce a better result. We also considered two normalization techniques: conversion of ratings to zscores and subtraction of customer average from each rating. We found the latter approach to provide better results. After normalization we obtain a filled, normalized matrix Rnorm. Essentially, Rnorm = R+NPR, where NPR is the fill-in matrix that provides naive non-personalized recommendation. We factor the matrix Rnorm and obtain a low-rank approximation after applying the following steps described in (Deerwester et al. 1990): · factor Rnorm using SVD to obtain U, S and V. · reduce the matrix S to dimension k · compute the square-root of the reduced matrix Sk , to obtain Sk 1/2 · compute two resultant matrices: UkSk 1/2 and Sk 1/2Vk ¢ These resultant matrices can now be used to compute the recommendation score for any customer c and product p. Recall that the dimension of UkSk 1/2 is m ´ k and the dimension of Sk 1/2Vk ¢ is k ´ n. To compute the prediction we simply calculate the dot product of the c th row of UkSk 1/2 and the p th column of Sk 1/2Vk ¢ and add the customer average back using the following: C C U . S (c) S .V (P) P K k k k pred ¢ × ¢ = + . Note that even though the Rnorm matrix is dense, the special structure of the matrix NPR allows us to use sparse SVD algorithms (e.g., Lanczos) whose complexity is almost linear to the number of nonzeros in the original matrix R. 3.1.2 Recommendation generation In our second experiment, we look into the prospects of using low-dimensional space as a basis for neighborhood formation and using the neighbors’ opinions on products they purchased we recommend a list of N products for a given customer. For this purpose we consider customer preference data as binary by treating each non-zero entry of the customer-product matrix as 1. This means that we are only interested in whether a customer consumed a particular product but not how much he/she liked that product. Neighborhood formation in the reduced space: The fact that the reduced dimensional representation of the original space is less sparse than its highdimensional counterpart led us to form the neighborhood in that space. As before, we started with the original customer-product matrix A, and then used SVD to produce three decomposed matrices U, S, and V. We then reduced S by retaining only k eigenvalues and obtained Sk . Accordingly, we performed dimensionality reduction to obtain Uk and Vk . Like the previous method, we finally computed the matrix product UkSk 1/2. This m ´ k matrix is the k dimensional representation of m customers. We then performed vector similarity (cosine similarity) to form the neighborhood in that reduced space. Top-N Recommendation generation: Once the neighborhood is formed we concentrate on the neighbors of a given customer and analyze the products they purchased to recommend N products the target customer is most likely to purchase. After computing the neighborhood for a given customer C, we scan through the purchase record of each of the k neighbors and perform a frequency count on the products they purchased. The product list is then sorted and most frequently purchased N items are returned as recommendations for the target customer
We call this scheme most frequent item 4 Experiments 3.1.3 Sensitivity of Number of Dimensions k 4.1 Experimental Platform The optimal choice of the value k is critical to high- 4.1.1 Data sets quality prediction generation. We are interested in a value of k that is large enough to capture all the As mentioned before we report two different Important structures in the matrix yet small enough to experiments. In the first experiment we used data avoid overfitting errors. We experimentally find from our MovieLens recommender system to good value of k by trying several different values evaluate the effectiveness of our LSi-based prediction Movielens 3.1.4 Performance Implications recommender system that debuted in Fall 1997. Each Inpracticee-commercesiteslikeamazon.com week hundreds of users visit movielens to rate and experiences tremendous amount of customer visits receive recommendations for movies. The site now per day. Recommending products to these large has over 35000 users who have expressed opinions number of customers in real-time requires the on 3000+ different movies. We randomly selected underlying recommendation engine to be highly enough users to obtain 100,000 rating- records from scalable. Recommendation algorithms usually divide the database(we only considered users that had rated the prediction generation algorithm into two part twenty or more movies). Rating-record in this context offline component and online component. Offline is defined to be a triplet component is the portion of the algorithm that We divided the rating-records into training set and a requires an enormous amount of computation e.g test set according to different ratios We call this the computation of customer-customer correlation in training ratio and denote it by x. a value of x0.3 of correlation-based algorithm. Online indicates that we divide the 100,000 ratings data set omponent is the portion of the algorithm that is into 30.000 train cases and 70.000 test cases. The dynamically computed to provide predictions to training data was converted into a user-movie matrix customers using data from stored offline component R that had 943 rows (i.e, 943 users) and 1682 In case of SVD-based recommendation generation columns (i.e, 1682 movies that were rated by at least the decomposition of the customer-product matrix one of the users). Each entry ri, represented the and computing the reduced user and item matrices rating(from I to 5 )of the i user on themovie i.e., UKSk and Sk vk' can be done offline The second experiment is designed to test the Offline computation is not very critical to the effectiveness of"neighborhood formed in low performance of the recommender system. But there dimensional space". In addition to the above movie are some issues with the memory and secondary data, we use historical catalog purchase data from a storage requirement that need to be addressed. In case large e-commerce company. This data set contains of SvD, the offline component requires more time purchase information of 6,502 users on 23, 554 compared to the correlation-based algorithm. For an catalog items. In total, this data set contains 97, 045 mx n matrix the Svd decomposition requires a time purchase records. In case of the commerce data set, in the order of o((m+n)Deerwester et al., 1990) each record is a triplet . Since, purchase amount cant be storage, however, SVD is more efficient, we need to meaningfully converted to user rating, we didnt use store just two reduced customer and product matrices the second data set for prediction experiment. We of size m x k and k xn respectively, a total of converted all purchase amounts to"I"to make the O(m+n), since k is constant. But in case of the data set binary and then used it for recommendation correlation-based algorithm. a m x m all-to-all experiment. As before, we divided the data set into a correlation table must be stored requiring O(3) train set and a test set by using similar notion of ge, which can be substantially large with training ratio, x millions of customers and products So, we observe that as a result of dimensionality I In addition to Movie users, the system reduction SVd based online performance is much ratings from more than 45,000 better than correlation based algorithms For the same users. The eachMovie data is based on a stat ason, neighborhood formation is also much faster ble research by Digital when done in low dimensional space Corporations Systems Research Center
We call this scheme most frequent item recommendation. 3.1.3 Sensitivity of Number of Dimensions k The optimal choice of the value k is critical to highquality prediction generation. We are interested in a value of k that is large enough to capture all the important structures in the matrix yet small enough to avoid overfitting errors. We experimentally find a good value of k by trying several different values. 3.1.4 Performance Implications In practice, e-commerce sites like amazon.com experiences tremendous amount of customer visits per day. Recommending products to these large number of customers in real-time requires the underlying recommendation engine to be highly scalable. Recommendation algorithms usually divide the prediction generation algorithm into two parts: offline component and online component. Offline component is the portion of the algorithm that requires an enormous amount of computation e.g., the computation of customer-customer correlation in case of correlation-based algorithm. Online component is the portion of the algorithm that is dynamically computed to provide predictions to customers using data from stored offline component. In case of SVD-based recommendation generation, the decomposition of the customer-product matrix and computing the reduced user and item matrices i.e., UkSk 1/2 and Sk 1/2Vk ¢ can be done offline. Offline computation is not very critical to the performance of the recommender system. But there are some issues with the memory and secondary storage requirement that need to be addressed. In case of SVD, the offline component requires more time compared to the correlation-based algorithm. For an m ´ n matrix the SVD decomposition requires a time in the order of O((m+n)3 ) (Deerwester et. al., 1990). Computation of correlation takes O(m2 .n). In terms of storage, however, SVD is more efficient, we need to store just two reduced customer and product matrices of size m ´ k and k ´ n respectively, a total of O(m+n), since k is constant. But in case of the correlation-based algorithm, a m ´ m all-to-all correlation table must be stored requiring O(m2) storage, which can be substantially large with millions of customers and products. So, we observe that as a result of dimensionality reduction SVD based online performance is much better than correlation based algorithms. For the same reason, neighborhood formation is also much faster when done in low dimensional space. 4 Experiments 4.1 Experimental Platform 4.1.1 Data sets As mentioned before we report two different experiments. In the first experiment we used data from our MovieLens recommender system to evaluate the effectiveness of our LSI-based prediction algorithm. MovieLens (www.movielens.umn.edu) is a web-based research recommender system that debuted in Fall 1997. Each week hundreds of users visit MovieLens to rate and receive recommendations for movies. The site now has over 35000 users who have expressed opinions on 3000+ different movies.1 We randomly selected enough users to obtain 100,000 rating-records from the database (we only considered users that had rated twenty or more movies). Rating-record in this context is defined to be a triplet . We divided the rating-records into training set and a test set according to different ratios. We call this training ratio and denote it by x. A value of x=0.3 indicates that we divide the 100,000 ratings data set into 30,000 train cases and 70,000 test cases. The training data was converted into a user-movie matrix R that had 943 rows (i.e., 943 users) and 1682 columns (i.e., 1682 movies that were rated by at least one of the users). Each entry ri,j represented the rating (from 1 to 5) of the i th user on the j th movie. The second experiment is designed to test the effectiveness of “neighborhood formed in low dimensional space”. In addition to the above movie data, we use historical catalog purchase data from a large e-commerce company. This data set contains purchase information of 6,502 users on 23,554 catalog items. In total, this data set contains 97,045 purchase records. In case of the commerce data set, each record is a triplet . Since, purchase amount can’t be meaningfully converted to user rating, we didn’t use the second data set for prediction experiment. We converted all purchase amounts to “1” to make the data set binary and then used it for recommendation experiment. As before, we divided the data set into a train set and a test set by using similar notion of training ratio, x. 1 In addition to MovieLens' users, the system includes over two million ratings from more than 45,000 EachMovie users. The EachMovie data is based on a static collection made available for research by Digital Equipment Corporation's Systems Research Center
4.1.2 Benchmark recommender systems Root Mean Squared Error (RMSE) and To compare the performance of SVD-based Correlation between ratings and predictions are widely used metrics Our experience has shown prediction we also entered the training ratings set into a collaborative filtering recommendation engine that that these metrics typically track each other employs the Pearson nearest neighbor algorithm. For closely this purpose we implemented CF-Predict, a flexible Decision support accuracy metrics evaluate how recommendation engine that implements effective a prediction engine is at helping a user products from the set of CF-Predict to use the best published Pearson nearest products. These metrics assume the prediction neighbor algorithm and configured it to deliver the process as a binary operation-either products highest quality prediction without concern for are predicted (good) or not(bad). With this performance (i.e, it considered every possible observation, whether a product has a prediction neighbor to form optimal neighborhoods). To score of 1.5 or 2.5 on a five-point scale is compare the quality of SVD neighborhood-based irrelevant if the customer only chooses to commendations, we implemented another consider predictions of 4 or higher. The most recommender system that uses cosine-similarity in commonly used decision support accuracy high dimensional space to form neighborhood and metrics are reversal rate, weighted errors and eturns top-N recommendations, we call it CF- ROC sensitivity (Le et al., 1995) Recommend. We used cosine measure for building We used MAE as our choice of evaluation metric to neighborhood in both cases because in the low dimensional space proximity is measured only by report prediction experiments because it is most commonly used and easiest to interpret directly. In computing the cosine our previous experiments( Sarwar et al., 1999)we For each of the ratings in the test data set. we have seen that MAE and RoC provide the sar requested a recommendation score from CF-Predict ordering of different experimental schemes in terms and also computed a recommendation score from the of prediction quality matrices UKSkand SkVk and compared them 4.2.2 Top-n recommendation evaluation 4.2 Evaluation Metrics metrics Recommender systems research has used several To evaluate top-N recommendation we use two types of measures for evaluating the success of a metrics widely used in the information retrieval (IR) recommender system. We only consider the quality community namely recall and precision. However, of prediction or recommendation, as we're or we slightly modify the definition of recall and interested in the output of a recommender system for orecision as our experiment is different from standard the evaluation purpose. It is, however, possible to IR. We divide the products into two sets: the test set evaluate intermediate steps (e.g, the quality of and top-N set. Products that appear in both sets are neighborhood formation). Here we discuss two types members of the hit set. We now define recall and of metrics for evaluating predictions and top-M precision as the following commendations respectively Recall in the context of the recommender 4.2.1 Prediction evaluation metrics system is defined as ce of hit set lest∩ Recall= se the following metrics ce of test set test Coverage metrics evaluate the number of Precision is defined as products for which the system could provide recommendations Overall coverage Is test topi computed as the percentage of customer-product Precision ofhi size of topN N pairs for which a recommendation can be made These two measures are, however, often conflicting Statistical accuracy metrics evaluate the in nature. For instance. increasing the number N of tends to increase recall but decreases precision. The recommendation scores against the actual fact that both are critical for the quality judgement customer ratings for the customer-product pairs bination of the two. In in the test dataset. Mean Absolute Error(MAE particular, we use the standard FI metric(Yang et
4.1.2 Benchmark recommender systems To compare the performance of SVD-based prediction we also entered the training ratings set into a collaborative filtering recommendation engine that employs the Pearson nearest neighbor algorithm. For this purpose we implemented CF-Predict, a flexible recommendation engine that implements collaborative filtering algorithms using C. We tuned CF-Predict to use the best published Pearson nearest neighbor algorithm and configured it to deliver the highest quality prediction without concern for performance (i.e., it considered every possible neighbor to form optimal neighborhoods). To compare the quality of SVD neighborhood-based recommendations, we implemented another recommender system that uses cosine-similarity in high dimensional space to form neighborhood and returns top-N recommendations, we call it CFRecommend. We used cosine measure for building neighborhood in both cases because in the low dimensional space proximity is measured only by computing the cosine. For each of the ratings in the test data set, we requested a recommendation score from CF-Predict and also computed a recommendation score from the matrices UkSk 1/2 and Sk 1/2Vk ¢ and compared them. 4.2 Evaluation Metrics Recommender systems research has used several types of measures for evaluating the success of a recommender system. We only consider the quality of prediction or recommendation, as we're only interested in the output of a recommender system for the evaluation purpose. It is, however, possible to evaluate intermediate steps (e.g., the quality of neighborhood formation). Here we discuss two types of metrics for evaluating predictions and top-N recommendations respectively. 4.2.1 Prediction evaluation metrics To evaluate an individual item prediction researchers use the following metrics: ß Coverage metrics evaluate the number of products for which the system could provide recommendations. Overall coverage is computed as the percentage of customer-product pairs for which a recommendation can be made. ß Statistical accuracy metrics evaluate the accuracy of a system by comparing the numerical recommendation scores against the actual customer ratings for the customer-product pairs in the test dataset. Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Correlation between ratings and predictions are widely used metrics. Our experience has shown that these metrics typically track each other closely. ß Decision support accuracy metrics evaluate how effective a prediction engine is at helping a user select high-quality products from the set of all products. These metrics assume the prediction process as a binary operation—either products are predicted (good) or not (bad). With this observation, whether a product has a prediction score of 1.5 or 2.5 on a five-point scale is irrelevant if the customer only chooses to consider predictions of 4 or higher. The most commonly used decision support accuracy metrics are reversal rate, weighted errors and ROC sensitivity (Le et al., 1995) We used MAE as our choice of evaluation metric to report prediction experiments because it is most commonly used and easiest to interpret directly. In our previous experiments (Sarwar et al., 1999) we have seen that MAE and ROC provide the same ordering of different experimental schemes in terms of prediction quality. 4.2.2 Top-N recommendation evaluation metrics To evaluate top-N recommendation we use two metrics widely used in the information retrieval (IR) community namely recall and precision. However, we slightly modify the definition of recall and precision as our experiment is different from standard IR. We divide the products into two sets: the test set and top-N set. Products that appear in both sets are members of the hit set. We now define recall and precision as the following: ß Recall in the context of the recommender system is defined as: ß Precision is defined as: These two measures are, however, often conflicting in nature. For instance, increasing the number N tends to increase recall but decreases precision. The fact that both are critical for the quality judgement leads us to use a combination of the two. In particular, we use the standard F1 metric (Yang et. test test top N size of test set size of hit set Recall I = = N test topN size of topN set size of hit set Precision I = =
al. 1999)that gives equal weight to them both and is average back into each recommendation scores and computed as follows loaded the training set ratings into CF-Predict and request recommendation scores on each of the test set 人、2*Recl1* Precsion ratings. Computed MaE of the SVD and the CF- (Recall+Precision Predict recommendation scores and compare the two ets of results We compute FI for each individual customer and calculate the average value to use as our metric. We repeated the entire process for k=2, 5-21, 25, 50 and 100, and found 14 to be the most optimum value 4.3 Experimental Steps Figure 3(a)). We then fixed k at 14 and varied the train/test ratio x from 0.2 to 0.95 with an increment of 0.05 and for each point we experiment 10 4.3.1 Prediction Experiment imes each time choosing different training/test sets and take the average to generate the plots. Note that Each entry in our data matrix R represents a rating on the overall performance of the SVD-based prediction a 1-5 scale, except that in cases where the user i algorithm does significantly change for a wide range didnt rate movie j the entry Iii is null. We then of values of k performed the following experimental step We computed the average ratings for each user and 4.3.2 Top-N recommendation experiment for each movie and filled the null entries in the We started with a matrix as the previous experiment matrix by replacing each null entry with the column but converted the rating entries(i.e, non-zero entries) average for the corresponding column. Then we normalized all entries in the matrix by replacing each to 1. Then we produced top-10 product recommendations for each customer based on the entry rij with(rij-[). Then MATLAB was used to compute the Svd of the filled and normalized matrix following two schemes R, producing the three SVD component matrices U,S High dimensional neighborhood: In this scheme and v. S is the matrix that contains the singular we built the customer neighborhood values of matrix R sorted in decreasing order. Sk was original customer-product space and used most computed from S by retaining only k largest singular frequent item recommendation to produce top-10 values and replacing the rest of the singular with 0 roduct list. We then We computed the square root of the reduced matrix evaluate the quality and computed the matrices UKSk and SkV'k as Low dimensional neighborhood: We first reduce mentioned above, multiplied the matrices UASk"and the dimensionality of the original space by SkVk producing a 943 x 1682 matrix. Since the applying SVD and then used UkSk 2(i.e inner product of a row from UKSkand a column representation of customers in k dimensional from SkVk gives us a recommendation score, this space) matrix to build the neighborhood. As resultant matrix p holds the recommendation score SvD prediction qua ty variation with number of dimension (k is fixed at 14 for SvD) 夕步小的步步。的步 number of dimension, k Figure 3. (a) Determination of optimum value ofk.(b)SID vS. CF-Predict prediction quality for each user-movie pair i, j in Pir. We then de- frequent item normalize the matrix entries by adding the user
al., 1999) that gives equal weight to them both and is computed as follows: We compute F1 for each individual customer and calculate the average value to use as our metric. 4.3 Experimental Steps 4.3.1 Prediction Experiment. Each entry in our data matrix R represents a rating on a 1-5 scale, except that in cases where the user i didn’t rate movie j the entry ri,j is null. We then performed the following experimental steps. We computed the average ratings for each user and for each movie and filled the null entries in the matrix by replacing each null entry with the column average for the corresponding column. Then we normalized all entries in the matrix by replacing each entry ri,j with (ri,j - ri ). Then MATLAB was used to compute the SVD of the filled and normalized matrix R, producing the three SVD component matrices U, S and V'. S is the matrix that contains the singular values of matrix R sorted in decreasing order. Sk was computed from S by retaining only k largest singular values and replacing the rest of the singular with 0. We computed the square root of the reduced matrix and computed the matrices UkSk 1/2 and Sk 1/2V'k as mentioned above, multiplied the matrices UkSk 1/2 and Sk 1/2V'k producing a 943 x 1682 matrix. Since the inner product of a row from UkSk 1/2 and a column from Sk 1/2Vk gives us a recommendation score, this resultant matrix P holds the recommendation score for each user-movie pair i,j in Pij. We then denormalize the matrix entries by adding the user average back into each recommendation scores and loaded the training set ratings into CF-Predict and request recommendation scores on each of the test set ratings. Computed MAE of the SVD and the CFPredict recommendation scores and compare the two sets of results. We repeated the entire process for k = 2, 5-21, 25, 50 and 100, and found 14 to be the most optimum value (Figure 3(a)). We then fixed k at 14 and varied the train/test ratio x from 0.2 to 0.95 with an increment of 0.05 and for each point we run the experiment 10 times each time choosing different training/test sets and take the average to generate the plots. Note that the overall performance of the SVD-based prediction algorithm does significantly change for a wide range of values of k. 4.3.2 Top-N recommendation experiment: We started with a matrix as the previous experiment but converted the rating entries (i.e., non-zero entries) to 1. Then we produced top-10 product recommendations for each customer based on the following two schemes: ß High dimensional neighborhood: In this scheme we built the customer neighborhood in the original customer-product space and used most frequent item recommendation to produce top-10 product list. We then used our F1 metric to evaluate the quality. ß Low dimensional neighborhood: We first reduce the dimensionality of the original space by applying SVD and then used UkSk 1/2 (i.e., representation of customers in k dimensional space) matrix to build the neighborhood. As before we used most frequent item (Recall Precision) 2 Recall Precsion F1 + * * = SVD as Prediction Generator (k is fixed at 14 for SVD) 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 x (train/test ratio) MAE Pure-CF SVD SVD prediction quality variation with number of dimension 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 2 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2 5 50 100 number of dimension, k Mean absolute error x=0.2 x=0.5 x=0.8 Figure 3. (a) Determination of optimum value of k. (b) SVD vs. CF-Predict prediction quality
recommendation to produce top-10 list and 4.4.2 Top-n recommendation experiment evaluated by using FI metric In this experiment our main focus was on the e- For the recommendation experiment, we first commerce data. We also report our findings when we determined the optimum x ratio for both of our data apply this technique on our movie preference data sets in high dimensional and low dimensional cases At first we run the high dimensional experiment for 4.4 Results different x ratio and then we perform low dimensional experiments for different x values for a 4.4.1 Prediction experiment results fixed dimension (k)and compute the FI metric Figure 4 shows our results, we observe that optimum Figure 3(b) charts our results for the x values are 0. 8 and 0.6 for the movie data and the e- experiment. The data sets were obtained commerce data respectively. ame sample of 100,000 ratings, by varying of the training and test data sets, (recall tha Once we obtain the best x value, we run high Determination of the optimum x value -o-ML High-dim Determination of the optimum x value (Movie data set -a-ML Low-dim (Commerce data set) 8642 02030.40.5060.7080.9 train ratio. x Figure 4. Determination of the optimum value of x a) for the Movie data b) for the Commerce data ratio between the size of the training set and the size dimensional experiment for that x and compute F of the entire data set). Note that the different values metric. Then we run our low-dimensional of x were used to determine the sensitivity of the experiments for that x ratio, but vary the number of different schemes on the sparsity of the training set dimension, k. Our results are presented in figures 5 and 6. We represent the corresponding high dimensional results in the chart by drawing vertical Top-10 recommendatio L Low-dim (Movie data set) 0.226 High dimensional value at x=0.8 102030405060708090100 Dimension, k Figure 5. Top-10 recommendation results for the Movielens data set
recommendation to produce top-10 list and evaluated by using F1 metric. In this experiment our main focus was on the Ecommerce data. We also report our findings when we apply this technique on our movie preference data. 4.4 Results 4.4.1Prediction experiment results Figure 3(b) charts our results for the prediction experiment. The data sets were obtained from the same sample of 100,000 ratings, by varying the sizes of the training and test data sets, (recall that x is the ratio between the size of the training set and the size of the entire data set). Note that the different values of x were used to determine the sensitivity of the different schemes on the sparsity of the training set. 4.4.2 Top-N recommendation experiment results For the recommendation experiment, we first determined the optimum x ratio for both of our data sets in high dimensional and low dimensional cases. At first we run the high dimensional experiment for different x ratio and then we perform low dimensional experiments for different x values for a fixed dimension (k) and compute the F1 metric. Figure 4 shows our results, we observe that optimum x values are 0.8 and 0.6 for the movie data and the Ecommerce data respectively. Once we obtain the best x value, we run high dimensional experiment for that x and compute F1 metric. Then we run our low-dimensional experiments for that x ratio, but vary the number of dimension, k. Our results are presented in figures 5 and 6. We represent the corresponding high dimensional results in the chart by drawing vertical Top-10 recommendation (Movie data set) 0.22 0.222 0.224 0.226 0.228 0.23 0.232 10 20 30 40 50 60 70 80 90 100 Dimension, k F1 Metric ML Low-dim ML High-dim High dimensional value at x = 0.8 Figure 5. Top-10 recommendation results for the MovieLens data set. Determination of the optimum x value (Movie data set) 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 train ratio, x F1 Metric ML High-dim ML Low-dim Determination of the optimum x value (Commerce data set) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 train ratio, x F1 Metric EC High-dim EC Low-dim Figure 4. Determination of the optimum value of x. a) for the Movie data b) for the Commerce data
lines at their corresponding values approximation of the original space. Also another factor to consider is the amount of sparsity in the data 4.5 Discussion sets, the movie data is 95.4% sparse (100,000 nonzero entries in 943x1, 682 matrix), while the e- In case of the prediction experiment, we observe that commerce data is 99.996% sparse(97, 045 nonzero in Figure 3(b) for x0.5 hypothesis we deliberately increased sparsity of our however, the CF-Predict predictions are slightly movie data (i.e, remove nonzero entries) and better. This suggests that nearest-neighbor based epeated the experiment and observed dramatic collaborative filtering algorithms are susceptible to eduction in fl values data sparsity as the neighborhood formation process is hindered by the lack of enough training data. On Overall, the results are encouraging for the use of the other hand, SVd based prediction algorithms can SVD in collaborative filtering recommender systems overcome the sparsity problem by utilizing the latent The Svd algorithms fit well with the collaborative relationships. However, as the training data is filtering data, and they result in good quality increased both SVD and CF-Predict prediction predictions. And SVd has potential to provide better uality improve but the improvement in case of cF. online performance than correlation-based systems Predict surpasses the SVD improvement In case of the top-10 recommendation experiment we have seen even with a small fraction of dimension FI the plots of the recommender results(fi L.e., 20 out of 1682 in movie data, SVD-based and 6), we observe that for the movie data the best recommendation quality was better than result happens in the vicinity of k=20 and in case of corresponding high dimensional scheme. It indicates Top-10 recommendation — EC LOW-din Commerce data set) 一 EC High-dm 0.17 0.16 High dimensional value at x=0.6 0.14 0.13 0.12 0.11 50100150200250300350400450500600700 Dimension k Figure 6. Top-10 recommendation results for the E-Commerce data set the e-commerce data the recommendation quality that neighborhoods formed in the reduced keeps on growing with increasing dimensions. The dimensional space are better than their high movie experiment reveals that the low dimensional dimensional counterparts esults are better than the high dimensional counterpart at all values of k. In case of the e- commerce experiment the high dimensional result is always better, but as more and more dimensions are added low dimensional values improve. However, we increased the dimension values up to 700, but the low dimensional values were still lower than the high dimensional value. Since the commerce data is very 2 We're also working with experiments to use the reduced high dimensional(6502x23554), probably such a dimensional neighborhood for prediction generation using mall k value is not sufficient to provide a usef classical CF algorithm. So far, the results are encouraging
lines at their corresponding values. 4.5 Discussion In case of the prediction experiment, we observe that in Figure 3(b) for x0.5, however, the CF-Predict predictions are slightly better. This suggests that nearest-neighbor based collaborative filtering algorithms are susceptible to data sparsity as the neighborhood formation process is hindered by the lack of enough training data. On the other hand, SVD based prediction algorithms can overcome the sparsity problem by utilizing the latent relationships. However, as the training data is increased both SVD and CF-Predict prediction quality improve but the improvement in case of CFPredict surpasses the SVD improvement. From the plots of the recommender results (Figures 5 and 6), we observe that for the movie data the best result happens in the vicinity of k=20 and in case of the e-commerce data the recommendation quality keeps on growing with increasing dimensions. The movie experiment reveals that the low dimensional results are better than the high dimensional counterpart at all values of k. In case of the ecommerce experiment the high dimensional result is always better, but as more and more dimensions are added low dimensional values improve. However, we increased the dimension values up to 700, but the low dimensional values were still lower than the high dimensional value. Since the commerce data is very high dimensional (6502x23554), probably such a small k value is not sufficient to provide a useful approximation of the original space. Also another factor to consider is the amount of sparsity in the data sets, the movie data is 95.4% sparse (100,000 nonzero entries in 943x1,682 matrix), while the ecommerce data is 99.996% sparse (97,045 nonzero entries in 6,502x23,554 matrix). To test this hypothesis we deliberately increased sparsity of our movie data (i.e., remove nonzero entries) and repeated the experiment and observed dramatic reduction in F1 values! Overall, the results are encouraging for the use of SVD in collaborative filtering recommender systems. The SVD algorithms fit well with the collaborative filtering data, and they result in good quality predictions. And SVD has potential to provide better online performance than correlation-based systems. In case of the top-10 recommendation experiment we have seen even with a small fraction of dimension, i.e., 20 out of 1682 in movie data, SVD-based recommendation quality was better than corresponding high dimensional scheme. It indicates that neighborhoods formed in the reduced dimensional space are better than their high dimensional counterparts.2 2 We’re also working with experiments to use the reduced dimensional neighborhood for prediction generation using classical CF algorithm. So far, the results are encouraging. Top-10 recommendation (Commerce data set) 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 50 100 150 200 250 300 350 400 450 500 600 700 Dimension, k F1 Metric EC Low-dim EC High-dim High dimensional value at x = 0.6 Figure 6. Top-10 recommendation results for the E-Commerce data set