ARTICLE N PRESS xpert Systems with Applications xxx(2011)xXX-XXX Contents lists available at ScienceDirect Expert Systems with Applications ELSEVIER journalhomepagewww.elsevier.com/locate/eswa Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems Joel Pinho Lucas", Saddys Segrera, Maria N Moreno Departamento de informatica y Automatica, Universidad de salamanca, Plaza de la Merced s/n 37008,. Salamanca, spain ARTICLE FO ABSTRACT Nowadays, there is a constant need for perso in e-commerce systems. Recommender systems make suggestions and provide information Is available, however, many recommender tech- Associative classification niques are still vulnerable to some shorto his work, we analyze how methods employed in these systems are affected by some typical drawbacks. Hence, we conduct a case study using data gath- ed from real recommender systems in order to investigate what machine learning methods can all ate such drawbacks. Due to some especial features inherited by associative classifiers, particular attention to this category of methods to test their capability of dealing with typical G 2011 Elsevier Ltd. All right 1 Introduction other domains with similar drawbacks(Moreno, Ramos, Garcia, Toro, 2008), we try classification based on association, which is a Nowadays, e-commerce systems present loads of products machine learning technique that combines concepts from classifi- available for sale. Thus, users would probably have difficulty to cation and association, in recommender systems. As it will be de- choose the products they prefer and, consequently, to purchase scribed along the next sections, such technique may present them. Due to such facts and to a more and more competitive indus- several advantages if applied for building a recommender model try. these systems need to personalize their products'presentation. Moreover, in a preliminary study made in Lucas, Segrera, and A way to reach such personalization is by means of therecom- Moreno(2008), we proved that classification based on association mender systems", which according to Bae and Kim(2010) have have a promising potential in the recommender systems domain. emerged in e-commerce applications to provide advice to cus- n order to perform such analysis, we accomplished a case study tomer about items they might wish to purchase or examine. In this using two recommender systems databases. The first consists of nse, recommender systems aim at enabling the creation of a new ratings of movies made by movielens users and the second con- ore personally designed for each consumer (Schafer, Konstan, sists of book ratings made the BookCrossing community. We con- sidered essential to use MovieLens data on the case study we Taking into account that machine learning techniques are ap- propose, because almost every case study on recommender sys- plied for identifying patterns with different purposes, such as clas- tems we found employed mainly movie rating datasets(mostly sification or prediction and knowledge discovery, according MovieLens). Cheung, Kwok, Law, and Tsui(2003), these techniques can lowever, employing just one type of data from a single domain cessfully applied to predict users' preferences in recommender sys- may limit the scope of many case studies. This shortcoming on this tems. Machine learning methods can provide several research field is also confirmed by herlocker, Konstan, Terveen, provements on these systems and then provide more and Riedl(2004), who argues that the lack of variety in publicly ization. However, even employing machine learning methods, rec- available collaborative filtering datasets (particularly with signi ommender systems still suffer from innumerous limitations and cant numbers of ratings) remains one of the most significant chal may be very susceptible to produce errors. lenges in the field. In this way, we decided to employ, in addition to In this work we investigate the main shortcoming presented by a classic movie rating data, a different data base within a different recommender systems and how they affect recommendations' domain in order to enhance the quality of the case study presented quality. Afterwards, we analyze how they may be alleviated. There- in this work. fore, after obtaining successful results with association rules Herlocker et al.(2004)also affirm that even early I recognized that, in a recommender scenario it can be Corresponding author. Tel. +34 923294653: fax: +34 923294514 able to measure how often the system leads its users to wrong choices and that accuracy differences are usually tiny even when ucas).saddyseusales(S. Segrera) they are measurable. Based on such observations, we consider other 0957-4174/s- see front matter o 2011 Elsevier Ltd. All rights reserved doi:10.1016/eswa2011.07.136 Please cite this article in press as: Pinho Lucas, J, et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender sys tems.Expert Systems with Applications(2011). doi:10.1016/jeswa2011.07.136
Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems Joel Pinho Lucas ⇑ , Saddys Segrera, María N. Moreno Departamento de Informática y Automática, Universidad de Salamanca, Plaza de la Merced s/n 37008, Salamanca, Spain article info Keywords: Recommender systems Associative classification Sparsity abstract Nowadays, there is a constant need for personalization in e-commerce systems. Recommender systems make suggestions and provide information about items available, however, many recommender techniques are still vulnerable to some shortcomings. In this work, we analyze how methods employed in these systems are affected by some typical drawbacks. Hence, we conduct a case study using data gathered from real recommender systems in order to investigate what machine learning methods can alleviate such drawbacks. Due to some especial features inherited by associative classifiers, we give a particular attention to this category of methods to test their capability of dealing with typical drawbacks. 2011 Elsevier Ltd. All rights reserved. 1. Introduction Nowadays, e-commerce systems present loads of products available for sale. Thus, users would probably have difficulty to choose the products they prefer and, consequently, to purchase them. Due to such facts and to a more and more competitive industry, these systems need to personalize their products’ presentation. A way to reach such personalization is by means of the ‘‘recommender systems’’, which according to Bae and Kim (2010) have emerged in e-commerce applications to provide advice to customer about items they might wish to purchase or examine. In this sense, recommender systems aim at enabling the creation of a new store personally designed for each consumer (Schafer, Konstan, & Riedl, 2001). Taking into account that machine learning techniques are applied for identifying patterns with different purposes, such as classification or prediction and knowledge discovery, according to Cheung, Kwok, Law, and Tsui (2003), these techniques can be successfully applied to predict users’ preferences in recommender systems. Machine learning methods can provide several improvements on these systems and then provide more personalization. However, even employing machine learning methods, recommender systems still suffer from innumerous limitations and may be very susceptible to produce errors. In this work we investigate the main shortcoming presented by recommender systems and how they affect recommendations’ quality. Afterwards, we analyze how they may be alleviated. Therefore, after obtaining successful results with association rules in other domains with similar drawbacks (Moreno, Ramos, García, & Toro, 2008), we try classification based on association, which is a machine learning technique that combines concepts from classifi- cation and association, in recommender systems. As it will be described along the next sections, such technique may present several advantages if applied for building a recommender model. Moreover, in a preliminary study made in Lucas, Segrera, and Moreno (2008), we proved that classification based on association have a promising potential in the recommender systems domain. In order to perform such analysis, we accomplished a case study using two recommender systems databases. The first consists of ratings of movies made by MovieLens users and the second consists of book ratings made the BookCrossing community. We considered essential to use MovieLens data on the case study we propose, because almost every case study on recommender systems we found employed mainly movie rating datasets (mostly MovieLens). However, employing just one type of data from a single domain may limit the scope of many case studies. This shortcoming on this research field is also confirmed by Herlocker, Konstan, Terveen, and Riedl (2004), who argues that the lack of variety in publicly available collaborative filtering datasets (particularly with signifi- cant numbers of ratings) remains one of the most significant challenges in the field. In this way, we decided to employ, in addition to a classic movie rating data, a different data base within a different domain in order to enhance the quality of the case study presented in this work. Herlocker et al. (2004) also affirm that even early researchers recognized that, in a recommender scenario, it can be more valuable to measure how often the system leads its users to wrong choices and that accuracy differences are usually tiny even when they are measurable. Based on such observations, we consider other 0957-4174/$ - see front matter 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.07.136 ⇑ Corresponding author. Tel.: +34 923294653; fax: +34 923294514. E-mail addresses: joelpl@usal.es (J. Pinho Lucas), saddys@usal.es (S. Segrera), mmg@usal.es (M.N. Moreno). Expert Systems with Applications xxx (2011) xxx–xxx Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
ARTICLE IN PRESS netrics, besides accuracy, to evaluate algorithms in a er systems usually do not employ only collaborative filtering meth- He d ods. Likewise, content-based methods are usually not employed ot effective due to the lack of mecha- res. Hence, each of these ap- Bae Kim, 2010). g methods are nender sys- ntages formance d items to users
metrics, besides accuracy, to evaluate algorithms in a recommender context. Hence, we also consider the false positive occurrence and number of rules generated (if applicable for the algorithm being analyzed). Moreover, we also took into account some implicit features related to shortcomings of recommender systems. In the next section we present the main classes of methods employed in recommender systems. Subsequently, we depict the most important ones. At this time, we also describe the use of association rules for classification problems. In Section 4, we quote and explain the main shortcomings and research challenges related to recommender systems. Finally, Section 5 contains the case study accomplished in order to experiment methods’ capability for alleviating effects of shortcomings produced in recommender systems. 2. Types of recommender methods Methods employed in recommender systems have their foundations in different areas of knowledge. Cheung et al. (2003) and Lee, Kim, and Rhee (2001) classify recommender systems depending on the type of method used for making recommendations. The two main categories of recommender methods are collaborative filtering algorithms and content-based approaches. Content-based methods compare text documents to a specific user profile, where web objects are recommended to a user based on those he has been interested in the past (Lee et al., 2001). On the other hand, collaborative filtering methods were proposed aiming to provide more consistency to recommendations by means of information related to the social context of each user. To do so, the recommendation will be given to the active user according to item’s opinions given by other users who have similar profile (similar preferences about other items). The collaborative filtering approach was originally based on the nearest neighbor algorithm (Sarwar, Karypis, Konstan, & Reidl, 2001), which recommends products to a target user according to the opinions of users who have similar purchase patterns. Thus, recommended products will be the ones liked by users with similar interests. This approach appeals to the notion that when we are looking for information, we often seek the advice of friends with similar tastes or other people whose judgment we trust (Condliff, Lewis, Madigan, & Posse, 1999). In this way, information about items that other people have already found and evaluated is taken into account. Breese, Heckerman, and Kadie (1998) classified collaborative filtering methods into two groups: memory-based methods and model-based methods. In memory-based methods the nearest neighbors of a target user is found by matching the opinions of such user to the opinions of all system’s users. On the other hand, model-based methods build a predictive model by means of a training set comprising opinions acquired from a small portion of the system’s users. Such methods have been developed more recently in order to avoid the sparsity problem, which usually arises when memory-based methods are employed, because e-commerce systems generally offer millions of products for sale, so that it is not feasible to obtain opinions about all of them (Sarwar, Karypis, Konstan, & Riedl, 2000). Current recommender systems usually do not hold a substantial number of evaluations comparing with the number of items available in the system. As a consequence, the system seldom encompasses a complete database with evaluations of all items available. That is why model-based methods generally employ a training set with evaluations gathered from just a part of users of the system. Even though, generally the number of evaluations available is proportionally short and, as result, it is still necessary to develop more techniques to solve such shortcoming. Due to difficulties on obtaining considerable information of evaluation from users about system’s items, current recommender systems usually do not employ only collaborative filtering methods. Likewise, content-based methods are usually not employed solely, because they are not effective due to the lack of mechanisms to extract Web objects features. Hence, each of these approaches has its strengths and weaknesses (Bae & Kim, 2010). Therefore, content-based and collaborative filtering methods are commonly combined or employed together in recommender systems. Combining different methods to overcome disadvantages and limitations of a single method may improve the performance of recommenders (Göksedef & Gündüz-Ögüdücü, 2010). 3. Methods employed in recommender systems Seeing that recommender systems recommend items to users based on ratings or past customer behavior and also that there are usually several items and users, they need to be grouped in order to make recommendations feasible. In this way, classification methods are mostly employed to classify every user and/or item in one of those groups. Techniques used for classification are considered predictive methods because they aim at predicting the value of an attribute, called label-attribute, in a certain dataset. Each value of the label-attribute must be discrete since it is responsible for representing a class. The prediction of the label-attribute value is achieved by means of other attributes’ values (the descriptive attributes). The values of these attributes must be known in all samples, so that they may build a training set defining clearly the characteristics of all classes. Thus, classification is considered as a supervised learning approach. Furthermore, a test set is used to verify the consistence of the learning model. In this section we will describe the main techniques employed in recommender systems, which may be either supervised or unsupervised learning techniques. In the next subsection we depict the most common machine learning methods employed in recommender systems. Subsequently, we describe some foundations of association rules and how they can be employed in these systems. Finally, in Section 3.3, we describe classification based on association methods. 3.1. Machine learning methods According to Bae and Kim (2010), most researchers have been using data mining techniques in recommender systems aiming at predicting the customer’s future behavior and to increase the chance of repurchasing. In this way, and since data mining approaches are basically made of machine learning methods, we may say that such methods are quite popular in the recommender systems area. Billsus and Pazzani (1998) were the first to apply machine learning techniques in recommender systems, the authors proposed to transform the traditional collaborative filtering recommender problem (nearest neighbor) into a machine learning problem. To do so, authors firstly employed neural networks and considered the recommender problem as a classification problem. In this sense, a neural network constructs a model, defining classes of items, for recommendation using the ratings given by users and then classifying the non-rated items. More recent neural networks approaches (Chou, Li, Chen, & Wu, 2010) uses consumer knowledge upon items in order to provide personalization in ecommerce systems. They assume that costumers have some amount of experience or information about items they are using or plan to purchase. In this context, collaborative filtering may be seen as a prediction task, because the basic idea is to predict how a user will rate a new item (not rated before). That is the reason why this new approach for collaborative filtering was called item-based methods, because recommendations are given based 2 J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
ARTICLE IN PRESS even us sil g fu uzzy ns than other methods racy than classic orative fil- sociation rule sociation rules which is very e they can cluster. R m 1, also liked item nder sys-
on items’ features. However, nowadays item-based collaborative methods are generally known as model-based methods. Since most machine learning techniques employed in recommender systems are supervised learning methods, they need to be ran off-line in order to build a learning model used for classifi- cation. Thus, great part of the processing is performed off-line and hence, these methods do not require much on-line processing time, letting recommendations to be provided faster. Moreover, they may reduce shortcomings caused by sparsity, because they do not need ratings of all items available in order to build the classi- fication model. However, as highlighted in Section 2, due to difficulties on obtaining considerable information of evaluation from users about system’s items, model-based methods are commonly mixed with content-based methods in order to enhance recommendation’s quality. Despite being the first machine learning method applied for collaborative filtering, nowadays neural networks are not vastly employed in recommender systems. The ‘‘black box’’ learning approach of such method, where no information known except input and output, is a crucial issue for building recommender models. In addition, the training time of a neural network is costly, which might be a critical issue in current recommender systems, whose data, according to Koren (2010), is changing over time and models should be continuously updated to reflect its present nature. As classifiers may be implemented employing many different machine-learning techniques, naturally recommender systems expanded the model-based techniques they apply. Probably the most popular machine learning method for recommender systems are the Bayesian networks. They are an effective method employed in data mining that are widely applied for building recommender models (Condliff et al., 1999). The use of Bayesian networks for collaborative filtering was first suggested in Breese et al. (1998). It makes use of a training set to build a probabilistic structure for predicting user ratings. A distinct Bayesian network is used for representing every user who recommendations will be provided to and a learning algorithm is responsible for processing every network. Each node of the network represents, by means of a decision tree, a domain item and the edges represent information about the user. The states of a node correspond to rating values for the item being represented by the node. Since the classification model used for recommendations need to be build and, depending on the amount of data of the system, it may be very costly. On the other hand, the output model is small and its efficiency is similar to the nearest neighbor algorithm. However, it may still be not effi- cient enough depending on the number of items, because in order to predict the rating of a given item, the algorithm needs to calculate the conditional probabilities of all possible ratings for such item given all possible ratings for other items. Another popular data mining technique widely employed in recommender systems, as a model-based method, is clustering. It consists on performing non-supervised learning in order to identify groups of users who appear to have similar preferences. Therefore, recommendations provided to the active user will be related to the opinions (or ratings) given by users of the group he owns to. Most clustering algorithms require a distance metric or similarity metric, obtained by computing similarity between items, to guide the clustering process (Connor & Herlocker, 2001). Depending on the system, sometimes, after the clusters being created, opinions of other users may be averaged in the cluster in order to provide recommendations for the active user. Nevertheless, clustering techniques may use fuzzy logics and then every user may own to more than one cluster. In this way, each user will have partial participation in several clusters. At this point, a membership measure is assigned to each user in every cluster. Recommendation will be an average across the clusters, weighted by the membership value. However, even using fuzzy logics, according to Breese et al. (1998), clustering techniques usually produce less-personal recommendations than other methods, and in some cases, the clusters have worse accuracy than classic collaborative filtering (i.e., nearest neighbor) algorithms. Therefore, nowadays clustering techniques are usually applied in collaborative filtering together with other methods. In this context, they are applied as a first step in order to distribute the data for different recommender methods or to reduce the initial dataset. While dividing the population into clusters may hurt the accuracy of recommendations to users near the fringes of their assigned cluster, pre-clustering may be a worthwhile trade-off between accuracy and throughput (Schafer, 2005). Taking into account machine learning techniques mainly applied in collaborative filtering methods, the Support Vector Machines (SVMs) are quite frequent. Such technique consists in a supervised learning method for building a lineal classifier. In order to build a SVM, every user is seen as a vector composed by ratings about items. Such vectors are associated to a geometric space in which a hyperplane of separation between the possible classes is built. In this context, such classes are related to groups of users with similar preferences. Unlike other learning methods, SVM’s performance is related not to the number of features in the system, but to the margin with which it separates the data (Cheung et al., 2003). Association rule-based methods for classification are also employed in recommender systems. They are being more and more popular due to some benefits they present for the collaborative filtering context. In the next subsection we detail association rule discovery applied for recommender systems. 3.2. Association rules As well as most data mining techniques, association rules induction algorithms can be employed to enhance personalization in recommendation systems, such as the ones in Lee et al. (2001), Lazcorreta, Botella, and Fernández-Caballero (2008). Association rules were first introduced by Agrawal, Imielinski, and Swami (1993) aiming at discovering consuming patterns in retail databases. Thus, the task of discovering association rules in retail data was termed as ‘‘market basket analysis’’. Data stored in market basket are items bought on a per-transaction basis for a retailing organization. The representation of an association rule may be declared as A ? B, where A and B are item sets. Such representation states that, in a transaction, the occurrence of all items from ‘‘A’’ (antecedent side of the rule) results in the occurrence of items belonging to ‘‘B’’ (consequent side of the rule), such as A # I and B # I, where ‘‘I’’ is an item set. An association rule describes an association relation between item sets that occurs together on transactions of a data set. Thus, association, unlike classification, is not considered as a prediction task, because it aims at describing data. Therefore, association rule mining is not a machine learning method. There are several association rule mining algorithms available in the literature, such as ECLAT (Zaki, Parthasarathy, Ogihara, & Li, 1997), DIC (Brin, Motwani, Ullman, & Tsur, 1997) and FP-Growth (Han, Pei, & Yin, 2000). However, the Apriori (Agrawal et al., 1993) is certainly the most popular, and widely employed, association rules discovery algorithm nowadays (Neves, 2003). Its concepts and techniques are used by almost every algorithm proposed currently, which, are mostly mere extension of the Apriori (Neves, 2003). Our motivation to apply association rules in recommender systems is based on the structure of these rules, which is very appropriated for recommendation purposes, because they can learn patterns like ‘‘70% of users who liked item 1, also liked item J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx 3 Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
2". In this context, the rules can denot er on different beings can easier comprehend le algorithms than an output As highli ree data n recommender recommendations. An error in a
2’’. In this context, the rules can denote two types of associations: between users and items (in the case of content-based methods) and associations between items or users (in the case of collaborative filtering methods). The first type of association rules in recommender systems consists on associate active user’s profile data with items’ features. The second one, employed in collaborative filtering, consists on associating data of the active user with data of other users or on associating data of items the active user is interested in with data of other items available on the system. In Sun, Kong, and Chen (2005), for example, association rules are applied to mine relationships between items and then make the prediction about an item for an active user by computing the weighted sum of the ratings given by the user about the items similar to the target item. Géry and Haddad (2003) propose the use of implicit opinions (instead of ratings) of users. The problem of finding Web pages visited together is similar to find associations rules among itemsets in transaction databases. In most cases, rules may be discovered offline by processing data related to users’ opinions. Recommender models based on association rules are easy to be interpreted and are usually faster to be built than most machine learning models. A recommender model using Bayesian networks, for example, considers the conditional probabilities of all the possible opinions for an item given all possible opinions for other items. Moreover, association rules may alleviate the sparsity drawback, which will be described into the next section, because rules do not need to consider all opinions from users in order to build a recommender model. Furthermore, the system may take into account just the best rules for the recommender model. Rules may be evaluated and ranked by means of statistical measures like support and confidence. In the case study depicted in the next section, we will analyze in a more detailed way how association rules can reduce sparsity. However, Zhang and Chang (2005) affirm that recommender models built employing only association generally present poorquality recommendations. In Zhang and Chang (2005), association rules were combined with sequential rules to enhance recommendations efficiency. While in Lee et al. (2001), a method combining the nearest neighbor algorithm with association rules was developed. Such method employs information acquired from transactions of groups of users with similar preferences (neighbors) in order to discover association rules about Web objects. Alternatively, other machine learning methods could be applied for recommender systems, as the one proposed by Liu, Hsu, and Ma (1998). In such work, authors adapted association rules in order to play the role of a classifier. Such method will be described in the next subsection and it will be tried for recommender systems in Section 5. 3.3. Classification based on association methods As highlighted in the previous subsection, association rules aim at describing data and, consequently, they are seen a non-supervised learning method. On the other hand, a classification method is seen as a prediction technique, because it aims at predicting the value of an attribute (label) in a data set. The joining of concepts from classification and association (Liu et al., 1998) is an alternative approach for performing classification tasks, where association rules are employed as the basis of a classification method. Seeing that association models are commonly more effective than classifi- cation models, a crucial matter that encourages the use of association rules in classification is the high computational cost that current classification methods present. Several works (Li, Han, & Pei, 2001; Liu et al., 1998; Thabtah, Cowling, & Peng, 2005; Yin & Han, 2003) verify that classification based on association methods presents higher accuracy than traditional classification methods. Differing from association rules, decision trees, for example, do not consider simultaneous correspondences occurring on different attribute values. Moreover, human beings can easier comprehend an output provided by association rule algorithms than an output provided by usual classification techniques, such as artificial neural networks (Sarwar et al., 2000). Thabtah et al. (2005) sustain that a few accurate and effective classifiers based on associative classification have been presented recently, such as CBA (Classification Based in Association) (Liu et al., 1998), CPAR (Classification based on Predictive Association Rules) (Yin & Han, 2003), and CMAR (Classification based on Multiple class-Association Rules) (Li et al., 2001). Taking into account that for classification rule mining there is one and only one predetermined target, while for association rule mining the target of discovery is not pre-determined (Liu et al., 1998), it is necessary to constrain the rules’ consequent terms to encompass only one attribute. Thus, the consequent term of an association rule will represent the target, or class, attribute. Therefore, such rule can play a prediction role in a given system: in order to classify an item, the rule’s properties are matched to every rule’s antecedents and the attribute value of the consequent term (from one or more selected rules) will be the predicted class. Generally, the classification model is a set of ordered rules. In the CBA algorithm, for example, the rules are ordered by means of the confidence measure and it uses only one rule for performing classification. However, in this case some scenario in which could exist multiples rules with similar confidence measures may occur and, at the same time, with greatly different support measures. Hence, a rule A with much higher confidence than a rule B could be the one chosen for classification even if B had a much higher support (Li et al., 2001). The MCAR algorithm solves such drawback by means of an approach that considers, in addition to the confidence, the rules’ support. The CMAR algorithm has a fine approach for selecting association rules for classification, instead of using just one rule, it makes use of all rules that match the case to be classified. If the consequent term of all selected rules is the same, the predicted class will obviously be the value of the rules’ consequent term. Though, in a different scenario, rules are divided in groups according to the consequent terms’ values. The value chosen for classification is acquired through the group in which its elements hold the highest correlation value depending on the weighted v2 measure. Similarly to CMAR, the CPAR algorithm also divides rules in groups, though, instead of using all rules that match to the object to be predicted, it uses the ‘‘k’’ best rules that represent each class. Afterwards, the algorithm chooses a group, by means of the Laplace Accuracy measure, that will be the one used for classification. The drawbacks presented by association rules induction algorithms are, in general, the same ones of classification based on association algorithms. A critical drawback of these algorithms is due to those rules that have few attributes. Seeing that such rules expresses narrow information, an object which has few attributes would be ineffectively classified. Another critical drawback is due to the large number of rules that algorithms commonly produce (Sarwar et al., 2001), as a consequence, much of them do not supply relevant information or are contradictory. Such drawback is a critical issue related to associative classifiers, because the performance of the algorithm may be affected when retrieving, storing, pruning and sorting a large number of rules (Li et al., 2001). The CMAR algorithm tries to solve such drawback by implementing a FP-Tree data structure to store the association rules’ frequent itemsets. 4. Recommender systems shortcomings Shortcomings inherited by methods employed in recommender systems are reflected in erroneous recommendations. An error in a 4 J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
ARTICLE N PRESS J Pinho Lucas et aL/ Expert Systems with Applications xxx(2011)xxx-xXx recommendation may be presented as a false negative or a false are 1000 stores, 52 weeks in a year, 500,000 customers and 10,000 positive. The first one consists of products that were not recon products, the dataset has1000×52×10000×500000= mended, though the consumer would have liked them. The second 260,000,000, 000,000 potential cells. However, it might only have one consists of recommended products, though the consumer does 1, 872,000,000 populated cells, because there are 450,000 custom- not like them. According to Sarwar et al. (2000), the false positives ers shopping on average 26 times a year, buying 40 products at just are more critical because they will lead to angry consumers More- 2 stores. So, the dataset is 0.00036% sparse (936,000,000 over, even early researchers recognized that when recommender 260,000, 000,000,000)x 100) approaches are used to support decisions, it can be more valuable to measure how often the system leads its users to wrong choices 4.2. Scalability (Herlocker et al. 2004). Consequently, recommender method hould concern mostly on avoiding false positives Scalability is another drawback of recommender systems re- However, Claypool et al.(1999) have suggested that recom- sulted by the huge number of items available in it. Scalability in mender methods, unlike humans, have difficulty in distinguishing recommender systems includes both very large problem sizes between high-quality and low-quality information relating to the and real-time latency requirements ( Schafer et al, 2001). One same topic. Therefore, it might not be effective to provide recom- example of such requirements may be a scenario in which there mendations when evaluations acquired from users are not taken is a recommender system connected to a Web site needing to pre to account on the other hand the use of information from user vide recommendations within some milliseconds and. at the same evaluations probably provide more false positives. In order to re- time, serve thousands of users simultaneously. Thus, a major chal duce false positives, methods employed in recommender systems lenge in recommender systems nowadays, according to Schafer must avoid the occurrence of some typical drawbacks. Next sub- et al. (2001), is to adapt data mining techniques to meet simulta- sections will describe the four most critical drawbacks that may neously low latency and high throughput(amount of data flowing occur in these systems. in a system) requirements in order to attract and serve a huge number of users. In the recommender system context, the through- 4.1. Data sparsity put may be measured by the number of users and items that the system is able to support without affecting efficiency. Probably the biggest challenge recommender systems have Efficiency is a key feature for recommender systems, because nowadays is related to data sparsity problem due to the huge they need to supply fast feedback to their users. Generally, short- amount of data available in current recommender systems. Funda- comings resulted from scalability do not occur in model based mentally, sparsity occurs because the number of ratings needed to methods, because in these methods, differently from other classes build a prediction model is greater than the number of the ratings of methods, the computer data processing is usually not done dur- obtained from users. Moreover, most recommender techniques re- ing run time. quire user explicit expression of personal preferences for item Scalability may turn into a major concern for the efficiency of a Nevertheless, methods for obtaining ratings implicitly have been system, because some techniques, like searching for the nearest developed in order to add more ratings and reduce sparsity. How- neighbor, for example, may be unfeasible to be employed in sys- ever, even with the use of up to date methods(including data min- tems having a huge data base. a typical web-based recommender ing methods), sparsity still remain a critical drawback for system running only the nearest neighbor algorithm will probably recommender systems due to the extensive number of items avail- suffer serious scalability problems(Sarwar et al., 2001). ble. This is a significant problem because, in practice, it is usually costly and difficult to collect sufficient data for all users(Ahn, Kan 4.3. Early rater problem lee, 2010). According to Sarwar et al. (2001), active users may have purchased less than 1% of the items available in a system. This Despite that the drawbacks described before may be minimized neans that in a recommender system of movies owning 500,000 by means of usage of model-based methods, there are other shor items, for example, an active user would be able to rate 5000 mov- comings that may occur along with these methods. The Early Rater es, nevertheless, we cannot expect that all users of the system (or First-Rater) Problem(Claypool et al., 1999: Condliff et al, 1999 watch 5000 movies and provide ratings to all of them. In addition, is an example of drawback associated with every class of collabo- rating schemes can only be applied to homogeneous domains and rative filtering methods Such drawback arises when it is impossi- the number of ratings eligible to be used by the system is even ble to offer recommendations about an item that was just more restricted incorporated in the system and therefore, has few (or even nor Model based methods reduce shortcomings derived from spar- evaluations from users. In fact, the early rater problem is directly sity, however it is still necessary to have a certain minimum num- linked with sparsity, because when a system has a high numbe ber of ratings in order to build an estimation model of ratings. In of items, probably most of these items have never received any any case, owning ratings of just 1% of systems items may be, evaluation. Conceptually, the early-rater problem can be viewed according to the technique used, scarce to build a reliable model. as a special instance of the sparsity problem( huang, Chen, Zeng In this work we also analyze how the performance of classifiers 2004). may be affected by sparsity. Therefore, we address some questions anwar et al. (2001) affirm that current recommender syst related to sparsity: how we can nominate whether a dataset is depend on the altruism of a set of users who are willing to rate parse or not and if it is possible to measure the degree of sparsity many items without receiving many recommendations. Econo- of a dataset. Due to practical reasons sometimes industry and acad- mists have speculated that even if rating required no effort at all emy, evaluate sparsity considering the number of NULL NA values many users would choose to delay considering items to wait for presented by a certain dataset. Hence sparsity may be seen as den their neighbors to provide them with recommendations( Avery sity, which reflects both the overall size of recommenders item Zeckhauser, 1997). Thus, it is needed to find a way to encourage pace and the degree in which users have explored it(Herlocker users to made evaluations about items available in the system. etal.2004) Analogously, such drawback also occurs with a l In this context, in Rittman(2005) an example about a dataset the system, because since there is no information about him, it with four attributes in a retail market scenario is described. The would be impossible to determine his behavior in order to pre ttributes are: store, week in a year, costumer and product. If there him recommendations. Actually this scenario of the early rater Please cite this article in press as: Pinho Lucas, J, et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender sy tems.Expert Systems with Applications(2011). doi:10.1016/jeswa2011.07.136
recommendation may be presented as a false negative or a false positive. The first one consists of products that were not recommended, though the consumer would have liked them. The second one consists of recommended products, though the consumer does not like them. According to Sarwar et al. (2000), the false positives are more critical because they will lead to angry consumers. Moreover, even early researchers recognized that when recommender approaches are used to support decisions, it can be more valuable to measure how often the system leads its users to wrong choices (Herlocker et al., 2004). Consequently, recommender methods should concern mostly on avoiding false positives. However, Claypool et al. (1999) have suggested that recommender methods, unlike humans, have difficulty in distinguishing between high-quality and low-quality information relating to the same topic. Therefore, it might not be effective to provide recommendations when evaluations acquired from users are not taken into account. On the other hand, the use of information from user evaluations probably provide more false positives. In order to reduce false positives, methods employed in recommender systems must avoid the occurrence of some typical drawbacks. Next subsections will describe the four most critical drawbacks that may occur in these systems. 4.1. Data sparsity Probably the biggest challenge recommender systems have nowadays is related to data sparsity problem due to the huge amount of data available in current recommender systems. Fundamentally, sparsity occurs because the number of ratings needed to build a prediction model is greater than the number of the ratings obtained from users. Moreover, most recommender techniques require user explicit expression of personal preferences for items. Nevertheless, methods for obtaining ratings implicitly have been developed in order to add more ratings and reduce sparsity. However, even with the use of up to date methods (including data mining methods), sparsity still remain a critical drawback for recommender systems due to the extensive number of items available. This is a significant problem because, in practice, it is usually costly and difficult to collect sufficient data for all users (Ahn, Kang, & Lee, 2010). According to Sarwar et al. (2001), active users may have purchased less than 1% of the items available in a system. This means that in a recommender system of movies owning 500,000 items, for example, an active user would be able to rate 5000 movies, nevertheless, we cannot expect that all users of the system watch 5000 movies and provide ratings to all of them. In addition, rating schemes can only be applied to homogeneous domains and the number of ratings eligible to be used by the system is even more restricted. Model based methods reduce shortcomings derived from sparsity, however it is still necessary to have a certain minimum number of ratings in order to build an estimation model of ratings. In any case, owning ratings of just 1% of system’s items may be, according to the technique used, scarce to build a reliable model. In this work we also analyze how the performance of classifiers may be affected by sparsity. Therefore, we address some questions related to sparsity: how we can nominate whether a dataset is sparse or not and if it is possible to measure the degree of sparsity of a dataset. Due to practical reasons sometimes industry and academy, evaluate sparsity considering the number of NULL/NA values presented by a certain dataset. Hence, sparsity may be seen as density, which reflects both the overall size of recommender’s item space and the degree in which users have explored it (Herlocker et al., 2004). In this context, in Rittman (2005) an example about a dataset with four attributes in a retail market scenario is described. The attributes are: store, week in a year, costumer and product. If there are 1000 stores, 52 weeks in a year, 500,000 customers and 10,000 products, the dataset has 1000 52 10,000 500,000 = 260,000,000,000,000 potential cells. However, it might only have 1,872,000,000 populated cells, because there are 450,000 customers shopping on average 26 times a year, buying 40 products at just 2 stores. So, the dataset is 0.00036% sparse ((936,000,000/ 260,000,000,000,000) 100). 4.2. Scalability Scalability is another drawback of recommender systems resulted by the huge number of items available in it. Scalability in recommender systems includes both very large problem sizes and real-time latency requirements (Schafer et al., 2001). One example of such requirements may be a scenario in which there is a recommender system connected to a Web site needing to provide recommendations within some milliseconds and, at the same time, serve thousands of users simultaneously. Thus, a major challenge in recommender systems nowadays, according to Schafer et al. (2001), is to adapt data mining techniques to meet simultaneously low latency and high throughput (amount of data flowing in a system) requirements in order to attract and serve a huge number of users. In the recommender system context, the throughput may be measured by the number of users and items that the system is able to support without affecting efficiency. Efficiency is a key feature for recommender systems, because they need to supply fast feedback to their users. Generally, shortcomings resulted from scalability do not occur in model based methods, because in these methods, differently from other classes of methods, the computer data processing is usually not done during run time. Scalability may turn into a major concern for the efficiency of a system, because some techniques, like searching for the nearest neighbor, for example, may be unfeasible to be employed in systems having a huge data base. A typical web-based recommender system running only the nearest neighbor algorithm will probably suffer serious scalability problems (Sarwar et al., 2001). 4.3. Early rater problem Despite that the drawbacks described before may be minimized by means of usage of model-based methods, there are other shortcomings that may occur along with these methods. The Early Rater (or First-Rater) Problem (Claypool et al., 1999; Condliff et al., 1999) is an example of drawback associated with every class of collaborative filtering methods. Such drawback arises when it is impossible to offer recommendations about an item that was just incorporated in the system and, therefore, has few (or even none) evaluations from users. In fact, the early rater problem is directly linked with sparsity, because when a system has a high number of items, probably most of these items have never received any evaluation. Conceptually, the early-rater problem can be viewed as a special instance of the sparsity problem (Huang, Chen, & Zeng, 2004). Sarwar et al. (2001) affirm that current recommender systems depend on the altruism of a set of users who are willing to rate many items without receiving many recommendations. Economists have speculated that even if rating required no effort at all, many users would choose to delay considering items to wait for their neighbors to provide them with recommendations (Avery & Zeckhauser, 1997). Thus, it is needed to find a way to encourage users to made evaluations about items available in the system. Analogously, such drawback also occurs with a new user joining the system, because since there is no information about him, it would be impossible to determine his behavior in order to provide him recommendations. Actually, this scenario of the early rater J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx 5 Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
ARTICLE N PRESS J Pinho Lucas et al Expert Systems with Applications xxx(2011)xox-xocx problem is also referred as the Cold-Start problem"(Guo, 1997)in the WEKA(Waikato Environment for Knowledge Analysis) tool the literature. As extreme case of the early rater problem, when a was used to perform data transformation and pre-processing and collaborative filtering system first begins every user suffers from we applied the 10-fold cross-validation method to estimate algo- the early rater problem for every item( Claypool et al., 1999). rithms classification metrics for all experiments made In the next subsection we describe all datasets employed in this 4. case study, as well as how they were obtained. Subsequently, in Section 5.2, we compare some typical rule-based classifiers against Another drawback occurring just in collaborative filtering some general classifiers, where we take into account their behavior methods is related to the gray Sheep Problem( Claypool et al. associated to the drawbacks depicted in Section 4. In Sections 5.3 1999). Such drawback is associated to users whose preferences and 5.4 we analyze how rule-based algorithms and recommender do not match with the ones of any group of users. As a conse- systems data are related to sparsity(the most critical drawback user who stays long time in the condition of cold-start may be occurrence of false positive errors on three different groups of considered in the condition of gray sheep as well, because such classifiers user has not shown interest on system items. Thus, conceptually the gray sheep problem can be viewed as a special instance of 5.1. Employed datasets Given that in content-based methods preferences of other users In order to analyze sparsity on rule-based classifiers(Section is not considered for building recommendations, the gray sheep posing the Cba(Liu et al. 1998) and CMaR (Li et al. 2001)algo- early rater problem neither occurs in such class of methods, be- rithms, which are general data of different gathered from the UCI ause they may provide recommendation taking into account Machine Learning Repository. However, at this point we just com merely the features of an item. pared the results obtained by authors in each dataset, where we considered the number of attributes the number of records and sider the social context of the user, the system may only be capable the accuracy obtained by classifiers However, for the rest of the experiments performed in this case (Condliff et al, 1999), therefore the user will only get to see items study, we employed real data obtained from two recommender that are similar to those he or she has rated positively As a result, systems: MovieLens and BookCrossing. The data of MovieLens con- lots of false negatives may occur, because such methods are not sists of movies ratings made by users in 2000, which is a recom- lated to the same domain. Moreover in some domains the items available for research purposes. Initially, the MovieLens dataset current technology(such as movies, music, restaurants)(Balaba- by 943 users, but we integrated the data related to users and mov- ovid& Shoham, 1997). In web pages, for instance, only some fea- les into one file, which was the input provided for the algorithms tures of the content can be extracted because current information analyzed in this case study retrieval techniques ignore multimedia features like text embed However, before supplying such input we changed the rating ded in images for example(balabanovic Shoham, 1997). attribute in order to have only two values: "Not recommended (score 1 or 2)and"Recommended"(score 3, 4 or 5). The first one refers to an item the user may be interested in and the second re- 5. Case study fers to the opposite case. Such changes were performed to simplify classification because the main aim in a recommendation task is to In this section we describe a case study that aims at investigat- determine if an item should be offered to the user or not. taking ing algorithms and design conducts that may help to alleviate the into account users ' data, we used the following attributes: gender, lost common and challenging drawbacks inherited to recom- age and occupation. The age attribute was discretized in five age mender systems. To do so, we analyze the behavior of some classi- ranges. The users occupation attribute is a nominal variable with fication algorithms on general data and also on real data gathered 21 distinct values. Taking into account movies data, the file pro- from recommender systems. In order to evaluate such algorithms, vided by MovieLens originally contained 19 binary attributes re- we consider, besides the classic precision rate metric, some impe lated to movie genres. An instance with value 1 expressed that tant metrics in the recommender system context(i.e false posi- the movie belongs to a specific gender and 0 otherwise. The asso- tives rate and number of rules considered for classification). Hill, ciation models consistency would be compromised if 19, among Stead, Rosenstein, and Furnas(1995) suggested that algorithmic the 23 attributes on the dataset, were binaries. Thus, these 19 bin- improvements in collaborative filtering methods may come from ary attributes were reduced to just one attribute representing the different directions than just continued improvements in mean movie genre's name However, since some movies may belong to absolute error m oreover, though the new algorithms often appear different film genres, we only used the records containing ratings to do better than the older algorithms they are compared to, we about movies with just one genre. After data pre-processing and nd that when each algorithm is tuned to its optimum, they all transformation, 14, 587 records were remained in the input file produce similar measures of quality(Herlocker et aL, 2004). That for the algorithms used in this study is why we consider, in addition to numerical metrics, some implicit On the other hand, the database of BookCrossing consists of non numerical features where we assume empirical knowledge book ratings gathered by Ziegler, McNee, Konstan, and Lausen too. Consequently, we are able to analyze recommender tech-(2005)from the BookCrossing community. Users from this niques within different perspectives and not being restricted to community exchange books and experiences all around the world. evaluate methods simply by comparing classic accuracy metrics. Initially, the bookCrossing data contained 433, 671 explicit ratings In order to build experiments performed in this case study, we (an assigned mark from 1 to 10)about 185, 832 books provided considered the 26 datasets tested in the case studies accomplished by 77, 797 users. Such database has 2.33 ratings per item, for this in Liu et al. (1998)and Li et al. 2001)and, we also employed five reason it may be considered much sparser than MovieLens(which datasets obtained from real recommender systems. For all datasets has 59.45 ratings per item). This assumption makes the use of the Please cite this article in press as: Pinho Lucas. ] et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender sys- tems Expert Systems with Applications(2011). doi: 10.1016/jeswa. 2011.07 136
problem is also referred as the ‘‘Cold-Start problem’’ (Guo, 1997) in the literature. As extreme case of the early rater problem, when a collaborative filtering system first begins every user suffers from the early rater problem for every item (Claypool et al., 1999). 4.4. Gray sheep problem Another drawback occurring just in collaborative filtering methods is related to the gray Sheep Problem (Claypool et al., 1999). Such drawback is associated to users whose preferences do not match with the ones of any group of users. As a consequence, these users will not receive any recommendation. In fact, a user who stays long time in the condition of cold-start may be considered in the condition of gray sheep as well, because such user has not shown interest on system items. Thus, conceptually the gray sheep problem can be viewed as a special instance of the early-rater problem. Given that in content-based methods preferences of other users is not considered for building recommendations, the gray sheep problem does not occur in this class of methods. Moreover, the early rater problem neither occurs in such class of methods, because they may provide recommendation taking into account merely the features of an item. On the other hand, since content-based methods do not consider the social context of the user, the system may only be capable of recommending items that score highly against a user profile (Condliff et al., 1999), therefore, the user will only get to see items that are similar to those he or she has rated positively. As a result, lots of false negatives may occur, because such methods are not able to distinguish between low and high quality information related to the same domain. Moreover, in some domains the items are not amenable to any useful feature extraction methods with current technology (such as movies, music, restaurants) (Balabanovic´ & Shoham, 1997). In web pages, for instance, only some features of the content can be extracted, because current information retrieval techniques ignore multimedia features like text embedded in images for example (Balabanovic´ & Shoham, 1997). 5. Case study In this section we describe a case study that aims at investigating algorithms and design conducts that may help to alleviate the most common and challenging drawbacks inherited to recommender systems. To do so, we analyze the behavior of some classi- fication algorithms on general data and also on real data gathered from recommender systems. In order to evaluate such algorithms, we consider, besides the classic precision rate metric, some important metrics in the recommender system context (i.e., false positives rate and number of rules considered for classification). Hill, Stead, Rosenstein, and Furnas (1995) suggested that algorithmic improvements in collaborative filtering methods may come from different directions than just continued improvements in mean absolute error. Moreover, though the new algorithms often appear to do better than the older algorithms they are compared to, we find that when each algorithm is tuned to its optimum, they all produce similar measures of quality (Herlocker et al., 2004). That is why we consider, in addition to numerical metrics, some implicit non numerical features where we assume empirical knowledge too. Consequently, we are able to analyze recommender techniques within different perspectives and not being restricted to evaluate methods simply by comparing classic accuracy metrics. In order to build experiments performed in this case study, we considered the 26 datasets tested in the case studies accomplished in Liu et al. (1998) and Li et al. 2001) and, we also employed five datasets obtained from real recommender systems. For all datasets the WEKA (Waikato Environment for Knowledge Analysis) tool was used to perform data transformation and pre-processing and we applied the 10-fold cross-validation method to estimate algorithms’ classification metrics for all experiments made. In the next subsection we describe all datasets employed in this case study, as well as how they were obtained. Subsequently, in Section 5.2, we compare some typical rule-based classifiers against some general classifiers, where we take into account their behavior associated to the drawbacks depicted in Section 4. In Sections 5.3 and 5.4 we analyze how rule-based algorithms and recommender systems data are related to sparsity (the most critical drawback of recommender systems). Finally, in Section 5.5 we probe the occurrence of false positive errors on three different groups of classifiers. 5.1. Employed datasets In order to analyze sparsity on rule-based classifiers (Section 5.3), we considered the 26 datasets employed by the works proposing the CBA (Liu et al., 1998) and CMAR (Li et al., 2001) algorithms, which are general data of different gathered from the UCI Machine Learning Repository. However, at this point we just compared the results obtained by authors in each dataset, where we considered the number of attributes, the number of records and the accuracy obtained by classifiers. However, for the rest of the experiments performed in this case study, we employed real data obtained from two recommender systems: MovieLens and BookCrossing. The data of MovieLens consists of movies ratings made by users in 2000, which is a recommender system based on the GroupLens technology, and is freely available for research purposes. Initially, the MovieLens dataset contained approximately 100,000 ratings for 1682 movies made by 943 users, but we integrated the data related to users and movies into one file, which was the input provided for the algorithms analyzed in this case study. However, before supplying such input we changed the rating attribute in order to have only two values: ‘‘Not recommended’’ (score 1 or 2) and ‘‘Recommended’’ (score 3, 4 or 5). The first one refers to an item the user may be interested in and the second refers to the opposite case. Such changes were performed to simplify classification, because the main aim in a recommendation task is to determine if an item should be offered to the user or not. Taking into account users’ data, we used the following attributes: gender, age and occupation. The age attribute was discretized in five age ranges. The user’s occupation attribute is a nominal variable with 21 distinct values. Taking into account movies’ data, the file provided by MovieLens originally contained 19 binary attributes related to movie genres. An instance with value 1 expressed that the movie belongs to a specific gender and 0 otherwise. The association model’s consistency would be compromised if 19, among the 23 attributes on the dataset, were binaries. Thus, these 19 binary attributes were reduced to just one attribute representing the movie genre’s name. However, since some movies may belong to different film genres, we only used the records containing ratings about movies with just one genre. After data pre-processing and transformation, 14,587 records were remained in the input file for the algorithms used in this study. On the other hand, the database of BookCrossing consists of book ratings gathered by Ziegler, McNee, Konstan, and Lausen (2005) from the BookCrossing community. Users from this community exchange books and experiences all around the world. Initially, the BookCrossing data contained 433,671 explicit ratings (an assigned mark from 1 to 10) about 185,832 books provided by 77,797 users. Such database has 2.33 ratings per item, for this reason it may be considered much sparser than MovieLens (which has 59.45 ratings per item). This assumption makes the use of the 6 J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
ARTICLE IN PRESS current associative classifiers. Conversely. CMAR was presents a modern generation of these classi- ive data structure for storing classifica- et al. (2001), makes able to reduce e precision. At last, ociative classi- nsets. shold: 859 luced the ucture it USA)rem remained we also us values, de both datas for author a USA)and 3238 r Bayes Net C4.5 CBA CPAR CMAR ((%(%)(%) 2.881474.0785.16 World 80.87 80.21794773.2570.43 World 80.51 799880.5279.8641.41 80.23 12 81.3180.2878.15776 gus1081.53 80.8281.5676.7169.59 classifiers in order to alleviate typical drawbacks in recommender sys-
BookCrossing data even more interesting and innovative to test recommender methods. In this context, Herlocker et al. (2004) verified that different algorithms designed for recommender systems may be better or worse on different datasets and, in addition, many algorithms have been designed specifically for datasets having many more users than items. As a consequence, such algorithms may present a completely different performance when applied to datasets that do not own an abundant quantity of users per item or ratings per item. So that, testing recommender algorithms only using the MovieLens data might not take trustful evaluations for the algorithms use in general recommender systems. However, as the BookCrossing data was not used before for testing recommender algorithms, some efforts was required for preparing data to be supplied for the algorithms input. In the rest of this subsection, we describe the main approaches employed for performing data pre-processing and transformation. In order to simplify the classification, the rating attribute on the BookCrossing database was modified in the same way as the MovieLens was: ‘‘Not recommended’’ (score from 1 or 6) and ‘‘Recommended’’ (score from 7 to 10). For books’ data, we used two attributes from the dataset: publication Year and Author. The first was discretized in five ranges. The Author attribute was also modified, because at first it encompassed 48,234 distinct values. Thus, the dataset was reduced in order to this attribute encompasses only 40 distinct values (the ones that appear on more records). Taking into account users’ data, we also used two attributes: Age and Place where the user inhabits. The first was discretized in nine age ranges. The Place attribute originally contained the name of the city, the state or province, and the name of the country. However, in this way such attribute presented 12,952 distinct values. Therefore, we changed this attribute in order to encompass only 40 distinct values. For that reason and noticing that 75% of the places were from USA, we divided the dataset, based on this attribute, in two: places grouped by states of USA and places grouped by countries excepting USA. Afterwards, the first dataset (states of USA) remained with 25,523 records and the second one (countries) remained with 8926 records. In order to perform the case study on even more diverse data, we also used two more datasets, owing a smaller range of distinct values, derived from those mentioned before. So that, we copied both datasets and kept only 10 distinct values (the most frequent) for author and Country/State attributes. This way, we obtained two more datasets that contain 6270 records (on the dataset of states of USA) and 3238 records (on the dataset of countries). 5.2. Associative classifiers vs. general classification algorithms In this subsection we describe some experiments done intending to test classification algorithms, especially associative classifi- ers, on real recommender systems data. To do so, we compared three associative classifiers (CBA, CPAR and CMAR) with two traditional classifiers (Bayes Net and C4.5), where Bayes Net is a probabilistic algorithm and C4.5 a decision tree based algorithm. The first two were run through WEKA and the other three were obtained from LUCS-KDD Software Library, from the University of Liverpool. Bayes Net and C4.5 were chosen because, in addition to be classification methods widely employed in recommender systems, they represent two groups of machine learning methods: probabilistic classification (Bayes Net) and rule-based classification (C4.5). On the other hand, besides the existence of few accurate and effective associative classifiers, there are even fewer implementations available. In fact, the LUKS-KDD Software Library was the only software repository we found offering associative classifi- ers for free use. Therefore, we only employed three classification based on association algorithms in this case study. CBA was chosen because it was the first associative classifier built and its concepts are in most current associative classifiers. Conversely, CMAR was chosen because it represents a modern generation of these classi- fiers. It employs an alternative data structure for storing classification rules, which, according to Li et al. (2001), makes able to reduce processing time and, in some cases, increase precision. At last, CPAR was chosen because it is another popular associative classi- fier proposed more recently. In order to perform experiments, we defined a very low support threshold value (1%), for running classification based on association algorithms, to be able to obtain enough frequent itemsets. Conversely, we defined high values of confidence threshold: 85% to apply CBA and CPAR and 70% to apply CMAR. We reduced the confidence threshold to apply CMAR because the data structure it employs, a FP-Tree (Frequent Pattern Tree), stores frequent itemsets in a compact way in which common relations between itemsets are explored. In this way, items need to be frequent enough to be stored in the FP-Tree and to be considered at first time. For the datasets of BookCrossing containing 10 distinct values for Author and Country/State attributes, we increased the support threshold to 5% due to its reduced number of records. In Table 1 we show the results obtained after running the algorithms mentioned above. Each line depicts the accuracy obtained on each classifier, which is defined as the percentage of the correct classified samples among the whole data taken into account. Results revealed that the associative classifiers reached similar accuracy, excepting CMAR on BookCrossing data, to traditional classifiers (supervised learning). Actually, in some cases associative classifiers reached higher accuracy. Despite the fact of being the first method of classification based on association, the CBA algorithm reached the highest accuracy on two of the four datasets of BookCrossing. On MovieLens data, CMAR reached the highest accuracy, which was the best result obtained over all the experiments. Since rules provided by the associative classifiers hold a high confidence value (equal or greater than 70% or 85%), the rules used for building the classification models are reliable. The ninth rule generated by CMAR on MovieLens data is an example of this kind of rule: ‘‘age = (Li et al., 2001; Liu et al., 1998; Lucas & P, 2008; Moreno et al., 2008; Neves, 2003; Rittman, 2005; Sarwar et al., 2000, 2001; Schafer, 2005; Schafer et al., 2001) & genre = ‘drama’ ? rating = ‘yes’’’. Such rule states that, if a user is older than 25 years and younger than 34 years old, he will probably rate positively a drama movie. Despite of presenting the highest precision over all experiments (85.16% on MovieLens data), CMAR presented lower precision than other classifiers on the BookCrossing datasets. MovieLens and BookCrossing data basically differ on the number of distinct values of their attributes. MovieLens has only two distinct values on the Genre attribute, for example, and the other attributes have, in general, less distinct values than MovieLens datasets when the ratio of records/number of distinct values is taken into consideration. Moreover, it has a ratio of 59.45 ratings per item, which is considerably greater than the ratio of MovieLens (2.33 ratings per item). Taking into account the datasets of BookCrossing, when comparing the datasets of ‘‘world countries’’ with the datasets of Table 1 Comparison of classifiers. Data Bayes Net (%) C4.5 (%) CBA (%) CPAR (%) CMAR (%) MovieLens 81.95 82.88 81.4 74.07 85.16 BCrossing World 80.87 80.21 79.47 73.25 70.43 BCrossing World 10 80.51 79.98 80.52 79.86 41.41 BCrossing USA 80.23 81.31 80.28 78.15 77.66 BCrossing USA 10 81.53 80.82 81.56 76.71 69.59 J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx 7 Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
ARTICLE N PRESS J. Pinho Lucas et al. Expert Systems with Applications xxx(2011)x0x-xoXx " states of USA", we noticed a slight loss of precision, in most sce- and attributes. Such correlation was built taking into account the narios, on the dataset of world countries. Such loss is reasonable approach shown in Rittman(2005 ) which was described in detail due to the diversification of characteristics that readers from in the Section 4. 1. A high value of such correlation suggests a dense ferent world countries(from four continents) may have, which (non-sparse)dataset(high number of records compared with the are much more substantial than people from just one number of distinct values of its attributes ) because it is easier to Comparing datasets having 40 distinct values (on C have frequent itemsets if there are more records available. and Author attributes) with datasets having 10 distinct In Table 2 we analyze the performance of C4.5, Ripper, CBA and values, we noticed that there was not a significant difference of CMAR on sparse datasets. to do so, we selected (among the 26 precision among them with the values set confidence and support datasets analyzed). the four sparsest datasets(Sonar, Labor, Zoo threshold and Hepatic) for running the algorithms. The first four lines on At last, the CPar algorithm also presented acceptable results, the table depict the accuracy obtained by the classifiers and the even though its precision was slightly lower than ones of other last line depicts the correlation between the number of records lassifiers. Such algorithm is more effective for scenarios of very and attributes. large datasets where processing time may be a critical issue, be- Table 2 shows that the associative classifiers, CBa and Cmar cause the classifier construction and the rules induction are made obtained better results than the other rule-based classifiers(C4.5 in just one processing step. However, in the context of recom- and Ripper ), because on the four datasets they reached the highest lender systems, the response time to the user is not a critical is- accuracy (on bold )and excepting on the "Sonar dataset, both reached greater accuracy than C4. 5 and Ripper. Through these experiments we were able to conclude that clas- In order to compare the associative classifiers to the others sification based on association methods can be employed effec- algorithms, we show, on every subsequent table, the four datasets tively in recommender systems like other methods are, because corresponding to the best results reached by each algorithm. For they can reach similar, or even greater, precision to traditional each of the four algorithms analyzed we show the four datasets classifiers, and also because rules obtained by the classification corresponding to its best results when compared to other algo- model are feasible due to the high value of confidence they present. rithms. In this way, we may know which algorithms are more sus- Since the estimation model is built off-line, recommendations ceptible to sparsity, because we can analyze the density of the may be provided on real time without significant processing efforts dataset in which every algorithm got the best results and scalability, which is usually a challenging feature for recom Firstly, we verified the best results reached by CBa in compari- mender systems, is not a major design concern like for on-line rec- son to the non-associative classifiers(C4.5 and Ripper). Such com- ommender methods Moreover, as rule-based classifiers(especial parison was done through a variation metric considering the associative classifiers)are composed of trivial classification rules, difference of accuracy between CBa and C4.5 plus the difference an estimation model would be easy to be interpreted. It allows of accuracy between CBA and Ripper. Subsequently, we specify a the analyst to interfere and interpret easily the classification model general formula for such metric. generated. Another advantage of a rule-based recommender model is due to it is not static, since it is not needed to build a model for every user(there are general rules for all users ). This way, the gray Metric=(accuracy of algorithm, -accuracy of algorithm2) sheep problem would be less likely to occur using associative clas- (accuracy of algorithm, -accuracy of algorithm) sifiers as well Subsequently, we are going to verify by means of several data- Taking into account the first column of Table 3, for example, w sets with different features and from different domains. how much can check that the accuracy of CBA is 96.8%, of C4.5 is 92. 2% and of rule-based classifiers are vulnerable to sparsity In this way we Ripper is 88.1%. Consequently, the variation metric in this case is will be able to verify if there are even more advantages for allevi- 13.3%(4.6%+8.7%) Table 3 shows the best results of CBA, where ting typical shortcomings, on employing rule-based classifiers oI the first three lines depict the accuracy obtained by the classifier recommender systems. and the last two lines depict, respectively, the variation metric de- fined previously and the correlation between the number of re- 5.3. Analyzing sparsity on rule-based classifiers cords and attributes In this section we describe a case study that evaluates how Table 2 sparsity affects some rule-based classifiers accuracy. Therefore, Algorithm's precision on the sparsest datasets. we took into account the case studies accomplished in CBas(liu et al., 1998). CPAr's (Yin Han, 2003)and CMAr's (Li et al Sonar Labor 001)papers. Such papers compared the accuracy rate obtained C4.5 by their algorithms against Ripper(a traditional rule based classi- 77.5 81.8% fier)and C4.5 algorithms For performing their experiments run- CMAR 79.4% 89.7% 97.1% 80.6% g such algorithms, they employed twenty-six datasets, which are general data of different domains gathered from the UCI Ma- 208160=34757/16=356101/16=631155/19=8.16 tributes chine Learning Repository. In such analysis, for evaluating associa- tive classifiers, we took into account CBa and CMAr, and for evaluating rule-based algorithms we considered the same alge rithms(C4.5 and Ripper)analyzed in the case studies accomplished cBas best results in CBAs, CPAR's and CMAr's papers we did not consider the CPAr algorithm for evaluating sparsity because, as highlighted in the Glass Labor Sonar revious subsection, the other two associative classifiers(CBa CBA and CMAr)are more convenient to be employed in recommender C4.5 3.3% In order to consider sparsity, in this analysis we considered a neasure based in a relationship between the number of records Records attributes 101/16=6312149=23.7857/16=3.56208/60=3.47 Please cite this article in press as: Pinho Lucas. ] et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender sys- tems Expert Systems with Applications(2011). doi: 10.1016/jeswa. 2011.07 136
‘‘states of USA’’, we noticed a slight loss of precision, in most scenarios, on the dataset of world countries. Such loss is reasonable due to the diversification of characteristics that readers from different world countries (from four continents) may have, which probably are much more substantial than people from just one country. Comparing datasets having 40 distinct values (on Country/State and Author attributes) with datasets having 10 distinct values, we noticed that there was not a significant difference of precision among them with the values set confidence and support threshold. At last, the CPAR algorithm also presented acceptable results, even though its precision was slightly lower than ones of other classifiers. Such algorithm is more effective for scenarios of very large datasets where processing time may be a critical issue, because the classifier construction and the rules induction are made in just one processing step. However, in the context of recommender systems, the response time to the user is not a critical issue, because the recommender model is built off-line. Through these experiments we were able to conclude that classification based on association methods can be employed effectively in recommender systems like other methods are, because they can reach similar, or even greater, precision to traditional classifiers, and also because rules obtained by the classification model are feasible due to the high value of confidence they present. Since the estimation model is built off-line, recommendations may be provided on real time without significant processing efforts and scalability, which is usually a challenging feature for recommender systems, is not a major design concern like for on-line recommender methods. Moreover, as rule-based classifiers (especial associative classifiers) are composed of trivial classification rules, an estimation model would be easy to be interpreted. It allows the analyst to interfere and interpret easily the classification model generated. Another advantage of a rule-based recommender model is due to it is not static, since it is not needed to build a model for every user (there are general rules for all users). This way, the gray sheep problem would be less likely to occur using associative classifiers as well. Subsequently, we are going to verify, by means of several datasets with different features and from different domains, how much rule-based classifiers are vulnerable to sparsity. In this way, we will be able to verify if there are even more advantages, for alleviating typical shortcomings, on employing rule-based classifiers on recommender systems. 5.3. Analyzing sparsity on rule-based classifiers In this section we describe a case study that evaluates how sparsity affects some rule-based classifiers accuracy. Therefore, we took into account the case studies accomplished in CBA’s (Liu et al., 1998), CPAR’s (Yin & Han, 2003) and CMAR’s (Li et al., 2001) papers. Such papers compared the accuracy rate obtained by their algorithms against Ripper (a traditional rule based classi- fier) and C4.5 algorithms. For performing their experiments running such algorithms, they employed twenty-six datasets, which are general data of different domains gathered from the UCI Machine Learning Repository. In such analysis, for evaluating associative classifiers, we took into account CBA and CMAR, and for evaluating rule-based algorithms we considered the same algorithms (C4.5 and Ripper) analyzed in the case studies accomplished in CBA’s, CPAR’s and CMAR’s papers. We did not consider the CPAR algorithm for evaluating sparsity because, as highlighted in the previous subsection, the other two associative classifiers (CBA and CMAR) are more convenient to be employed in recommender systems. In order to consider sparsity, in this analysis we considered a measure based in a relationship between the number of records and attributes. Such correlation was built taking into account the approach shown in Rittman (2005), which was described in detail in the Section 4.1. A high value of such correlation suggests a dense (non-sparse) dataset (high number of records compared with the number of distinct values of its attributes), because it is easier to have frequent itemsets if there are more records available. In Table 2 we analyze the performance of C4.5, Ripper, CBA and CMAR on sparse datasets. To do so, we selected, (among the 26 datasets analyzed), the four sparsest datasets (Sonar, Labor, Zoo and Hepatic) for running the algorithms. The first four lines on the table depict the accuracy obtained by the classifiers and the last line depicts the correlation between the number of records and attributes. Table 2 shows that the associative classifiers, CBA and CMAR, obtained better results than the other rule-based classifiers (C4.5 and Ripper), because on the four datasets they reached the highest accuracy (on bold) and excepting on the ‘‘Sonar dataset’’, both reached greater accuracy than C4.5 and Ripper. In order to compare the associative classifiers to the others algorithms, we show, on every subsequent table, the four datasets corresponding to the best results reached by each algorithm. For each of the four algorithms analyzed, we show the four datasets corresponding to its best results when compared to other algorithms. In this way, we may know which algorithms are more susceptible to sparsity, because we can analyze the density of the dataset in which every algorithm got the best results. Firstly, we verified the best results reached by CBA in comparison to the non-associative classifiers (C4.5 and Ripper). Such comparison was done through a variation metric considering the difference of accuracy between CBA and C4.5 plus the difference of accuracy between CBA and Ripper. Subsequently, we specify a general formula for such metric. Metrici ¼ ðaccuracy of algorithm1 accuracy of algorithm2Þ þ ðaccuracy of algorithm1 accuracy of algorithm3Þ Taking into account the first column of Table 3, for example, we can check that the accuracy of CBA is 96.8%, of C4.5 is 92.2% and of Ripper is 88.1%. Consequently, the variation metric in this case is 13.3% (4.6% + 8.7%). Table 3 shows the best results of CBA, where the first three lines depict the accuracy obtained by the classifier and the last two lines depict, respectively, the variation metric de- fined previously and the correlation between the number of records and attributes. Table 2 Algorithm’s precision on the sparsest datasets. Sonar Labor Zoo Hepatic C4.5 70.2% 79.3% 92.2% 80.5% Ripper 78.4% 84.0% 88.1% 76.7% CBA 77.5% 86.3% 96.8% 81.8% CMAR 79.4% 89.7% 97.1% 80.6% Records/ attributes 208/60 = 3.47 57/16 = 3.56 101/16 = 6.31 155/19 = 8.16 Table 3 CBA’s best results. Zoo Glass Labor Sonar CBA 96.8% 73.9% 86.3% 77.5% C4.5 92.2% 68.7% 79.3% 70.2% Ripper 88.1% 69.1% 84.0% 78.4% metric1 13.3% 10.0% 9.3% 6.4% Records/attributes 101/16 = 6.31 214/9 = 23.78 57/16 = 3.56 208/60 = 3.47 8 J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
ARTICLE N PRESS recommender systems, because we want to investigate methods best results that are capable of alleviating effects of sparsity in recommender Waveform systems. Furthermore, taking into account results of the experi- ments made in the previous subsection, in the experiments de scribed subseque ve considered specifically associative classifiers(CBa and CMAr). In this way. we are able to depict a more accurate and less redundant case study Records/ 5000 16=3.5616=63118=82221=238.1 Before performing the experiments, we analyzed how sparsity features of the five datasets acquired from MovieLens and Book Crossing, which were described in Section 5. 1. To do so, we also considered the approach taken in Rittman(2005), which was de- hrough the results shown on Table 3, we can conclud scribed in detail in Section 4. 1. At this point, we also take into ac non-associative classifiers are much more susceptible to count the number of distinct values of the attributes because the than CBA. because the four datasets in which metric nvestigated datasets have similar features. the greatest values are quite sparse(the records attributes correla- Hence, we first take into account the Movielens dataset(with tion is, excepting the glass dataset, lower than 10) 14, 587 records)and calculate the product of distinct values of its Analogously, we also analyzed the four datasets in which CMAR attributes: 2x2x 7 x 21 x 14=17,052. For BookCrossing. the presented the greatest variation of accuracy (i.e, metric) when product of distinct values both"Country " and"State"datasets is compared to C4.5 and Ripper. Table 4 shows such results. 144000(2×40×5×40×9) and the product of distinct values Results revealed that the non-associative classifiers are also for the reduced datasets is 9000(2 x 10 x 5 x 10 x 9). Table 7 much more susceptible to sparsity than CMAR, because the first shows the density correlation ( considering the approach taken in three datasets are quite sparse. However CBa appeared to be less Rittman(2005))of the datasets used within the experiments made usceptible to sparsity than CmAr, because the values of the corre in this subsection. lation records/attributes are greater on the results of CMAr As shown on Table 7, the"BCrossing World"and"CRossing Waveform dataset, for example, is quite dense) USA "are the sparsest datasets. Not surprisingly, the two datasets Conversely, we also compared C4.5 and ripper to the associa- with reduced number of attributes are less sparse. Conversely tive classifiers. Table 5 shows the best results of C4.5, where met- the MovieLens is the densest dataset used in this study(0.86). rics is the difference of accuracy between C4.5 and CBa plus the For this reason, the movieLens dataset is used in mostly recom- difference of accuracy between C4.5 and CMAR mender systems' works, because its density make easier to develop Results shown in Table 5 revealed that the datasets in which trustful case studies C4.5 reached its greatest accuracies are very dense(excepting the The datasets shown in Table 7 were provided as input to CBa Auto dataset). Analogously, the results in Table 6 (Rippers best re- and CMAR algorithms, in which we set a confidence threshold va- sults) have shown than Ripper presented similar results to C4.5. lue of 75%. The support threshold values were set empirically Thus, results suggest that non-associative classifiers are according to each dataset and algorithm in a way we could obtain more susceptible to sparsity than associative classifiers. M at least 10 rules on the classification model in table 4 we show the the values of the variation metric in C4.5 and ripper are results obtained by CBA, where we depict the number of rules built lower than the ones of CBA and CMAR, which reveals that there for the classification model. Analogously, in Table 5 we show the results obtained by the Cmar classifier 5.4. Analyzing sparsity on recommender systems data for MovieLens(the densest dataset ) where CMAR presented better accuracy(75.63%) and the greatest support threshold within its re- sults. However, for sparser datasets, CMAR presented worse results, In this subsection, we continue analyzing the effects caused by because. as shown in Table 7. the datasets of countries are span sparsity. However, at this point we are concerned specifically on than the datasets of states of USA and, as shown in Table 9, there was a great loss of accuracy for"BCrossing World"compared with e 5 BCrossing USA"and for"CRossing World 10"compared with C4.5's best results BCrossing USA 10". On the other hand, as shown in Table 8, CBA Vehicle Auto Led did not present a great loss (less than 1%)of accuracy for the data sets of countries C45 726% 73.5% 729% 71.9% A similar scenario occurred for the datasets with 40 distinct va- CMAR 5.1% ues compared with the ones with 10 distinct values, to which CMAR presented a loss of accuracy. Such datasets own substantial 3200 less records(around 75%)than the ones with 40 distinct values, 18=47 25=8.2 8=96 7=457.1 which means they are less likely to present frequent itemsets and to identify relationships to build rules. Conversely, CBa did not lose accuracy in these datasets. able 6 Rippers best results. Austral Hypo Dataset's density 848% 87.3% Number of records/product of distinct values 0.9% Crossing worle 926/144.000=0062 attributes22=16.714=49.329=96.525=126.5 Movielen 14587/17.052=0.86 Please cite this article in press as: Pinho Lucas, J, et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender sy tems.Expert Systems with Applications(2011). doi:10.1016/jeswa2011.07.136
Through the results shown on Table 3, we can conclude that the non-associative classifiers are much more susceptible to sparsity than CBA, because the four datasets in which metric1 presented the greatest values are quite sparse (the records/attributes correlation is, excepting the Glass dataset, lower than 10). Analogously, we also analyzed the four datasets in which CMAR presented the greatest variation of accuracy (i.e., metric2) when compared to C4.5 and Ripper. Table 4 shows such results. Results revealed that the non-associative classifiers are also much more susceptible to sparsity than CMAR, because the first three datasets are quite sparse. However, CBA appeared to be less susceptible to sparsity than CMAR, because the values of the correlation records/attributes are greater on the results of CMAR (the Waveform dataset, for example, is quite dense). Conversely, we also compared C4.5 and Ripper to the associative classifiers. Table 5 shows the best results of C4.5, where metric3 is the difference of accuracy between C4.5 and CBA plus the difference of accuracy between C4.5 and CMAR. Results shown in Table 5 revealed that the datasets in which C4.5 reached its greatest accuracies are very dense (excepting the Auto dataset). Analogously, the results in Table 6 (Ripper’s best results) have shown than Ripper presented similar results to C4.5. Thus, results suggest that non-associative classifiers are much more susceptible to sparsity than associative classifiers. Moreover, the values of the variation metric in C4.5 and Ripper are notably lower than the ones of CBA and CMAR, which reveals that there is lower. 5.4. Analyzing sparsity on recommender systems data In this subsection, we continue analyzing the effects caused by sparsity. However, at this point we are concerned specifically on recommender systems, because we want to investigate methods that are capable of alleviating effects of sparsity in recommender systems. Furthermore, taking into account results of the experiments made in the previous subsection, in the experiments described subsequently we considered specifically associative classifiers (CBA and CMAR). In this way, we are able to depict a more accurate and less redundant case study. Before performing the experiments, we analyzed how sparsity features of the five datasets acquired from MovieLens and Book Crossing, which were described in Section 5.1. To do so, we also considered the approach taken in Rittman (2005), which was described in detail in Section 4.1. At this point, we also take into account the number of distinct values of the attributes, because the investigated datasets have similar features. Hence, we first take into account the Movielens dataset (with 14,587 records) and calculate the product of distinct values of its attributes: 2 2 7 21 14 = 17,052. For BookCrossing, the product of distinct values both ‘‘Country’’ and ‘‘State’’ datasets is 144,000 (2 40 5 40 9) and the product of distinct values for the reduced datasets is 9000 (2 10 5 10 9). Table 7 shows the density correlation (considering the approach taken in Rittman (2005)) of the datasets used within the experiments made in this subsection. As shown on Table 7, the ‘‘BCrossing World’’ and ‘‘BCrossing USA’’ are the sparsest datasets. Not surprisingly, the two datasets with reduced number of attributes are less sparse. Conversely, the MovieLens is the densest dataset used in this study (0.86). For this reason, the MovieLens dataset is used in mostly recommender systems’ works, because its density make easier to develop trustful case studies. The datasets shown in Table 7 were provided as input to CBA and CMAR algorithms, in which we set a confidence threshold value of 75%. The support threshold values were set empirically according to each dataset and algorithm in a way we could obtain at least 10 rules on the classification model. In Table 4 we show the results obtained by CBA, where we depict the number of rules built for the classification model. Analogously, in Table 5 we show the results obtained by the CMAR classifier. Results revealed associative classifiers reached similar accuracy for MovieLens (the densest dataset), where CMAR presented better accuracy (75.63%) and the greatest support threshold within its results. However, for sparser datasets, CMAR presented worse results, because, as shown in Table 7, the datasets of countries are sparser than the datasets of states of USA and, as shown in Table 9, there was a great loss of accuracy for ‘‘BCrossing World’’ compared with ‘‘BCrossing USA’’ and for ‘‘BCrossing World 10’’ compared with ‘‘BCrossing USA 10’’. On the other hand, as shown in Table 8, CBA did not present a great loss (less than 1%) of accuracy for the datasets of countries. A similar scenario occurred for the datasets with 40 distinct values compared with the ones with 10 distinct values, to which CMAR presented a loss of accuracy. Such datasets own substantial less records (around 75%) than the ones with 40 distinct values, which means they are less likely to present frequent itemsets and to identify relationships to build rules. Conversely, CBA did not lose accuracy in these datasets. Table 4 CMAR’s best results. Labor Zoo Lymph Waveform CMAR 89.7% 97.1% 83.1% 79.4% C4.5 79.3% 92.2% 73.5% 70.2% Ripper 84.0% 88.1% 79.0% 78.4% metric2 16.1% 13.9% 13.7% 12.3% Records/ attributes 57/ 16 = 3.56 101/ 16 = 6.31 148/ 18 = 8.22 5000/ 21 = 238.1 Table 5 C4.5’s best results. Vehicle Auto Pima Led7 C4.5 72.6% 80.1% 75.5% 73.5% CBA 68.7% 78.3% 72.9% 71.9% CMAR 68.8% 78.1% 75.1% 72.5% metric3 7.7% 3.8% 3.0% 2.6% Records/ attributes 846/ 18 = 47 205/ 25 = 8.2 768/ 8 = 96 3200/ 7 = 457.1 Table 6 Ripper’s best results. Horse Austral Sick Hypo Ripper 84.8% 87.3% 97.7% 98.9% CBA 82.1% 84.9% 97.0% 98.9% CMAR 82.6% 86.1% 97.5% 98.4% Variation 4.9% 3.6% 0.9% 0.5% Records/ attributes 368/ 22 = 16.7 690/ 14 = 49.3 2800/ 29 = 96.5 3163/ 25 = 126.5 Table 7 Dataset’s density. Dataset Number of records/product of distinct values BCrossing USA 25,523/144,000 = 0.18 BCrossing USA 10 6270/9000 = 0.7 BCrossing World 8926/144,000 = 0.062 BCrossing World 10 3,228 / 9000 = 0.36 MovieLens 14,587/17,052 = 0.86 J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx 9 Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
ARTICLE N PRESS J. Pinho Lucas et al. Expert Systems with Applications xxx(2011)x0x-xoXx Table 8 Table 10 Results of cBa False positive rates. Accuracy(%) Num rules Support(%) Densi C45(x) Crossing World 399 79.43 177 4245 BCrossing world 10 79.26 CRossing usa 41.55 BCrossing USA 10 34.66 duced as well, and therefore, the precision is also reduced. Never theless, conservative classifiers are appropriate for a recommender system scenario, in which false positives need to be avoided. Accuracy(%) Num rules Support(%) Densi It should be noted that, generally, associative classifiers do not consider a default rule to classify an instance which would not 69.58 35230 match any rule generated, as would be done in other traditional classifiers. This would indeed lead to recommend an item that does BCrossing World 10 42.34 11.7 not match the users needs. This way, an associative classifier does not classify a user whose data does not match with any rule gener- ated. Conversely, other classifiers always classify the active user Therefore, we can conclude that the CBa algorithm is more they classify every sample provided as input. appropriate than CMar to be applied in recommender systems. Re- Table 10 shows the false positive rates obtained by Bayes Net, sults have shown that CMAR is less effective on sparser datasets C4.5, CBA on the same datasets employed in the previous subsec- (more distinct values or less number of records ) This may be jus- tion, where we also set the same confidence and support threshold tified by the data structure it employs, which is a FP-Tree( Frequent values. attern Tree). Such structure stores frequent itemsets in a compact Table 10 shows that Bayes Net presented the greatest false po- n relations between itemsets are explored. sitive rates among all algorithms studied. This suggests that Bayes o items stored in the FP-Tree were not frequent enough to ge Net is more susceptible to the occurrence of false positive errors ate rules. Besides, usually every new associative classifier paper than rule-based classifiers. Taking into account the CBa algorithm, uses CBa as reference to make comparative studies and to validate it obtained, in average, a false positive rate of 10% lesser than C4.5. their algorithms. The main contribution of recent associative clas- which was superior to Bayes Net. In addition to that, the false po- sifiers is related to memory usage and processing time because, in sitive rate for the"CRossing USA"dataset was 21. 12% and in com- general, accuracy is higher just in some datasets and such incre- parison with Bayes Net and C4.5. This scenario shows the use of ment is not very substantial associative classifiers is very appropriate for avoiding false posi Subsequently, we are going to consider by means of false posi tives occurrence in recommender systems, given that they were tive occurrence on algorithms, more advantages on employing dramatically reduced using the CBa algorithm associative classifiers on recommender systems. 6. Conclusion 5.5. Analyzing false positives occurrence According to what was described in Section 4, desig In this work we focused on revealing how the most typical and ommender methods should have a major concern on avoiding fa critical drawbacks of recommender systems may avoided and alle- positive errors. In this context, in this subsection we desc viated. To do so, we tried some machine learning, especially classi fication based on association, methods on diverse types of data. We experiments we made in order to compare the false positive rates also used data gathered from real recommender systems, including of associative classifiers and traditional classifiers. To do so, we the BookCrossing, which is a data base concerning a domain that ose three algorithms: Bayes Net, C4.5 and CBA. The first two rep- ha as never been used before in case studies on recommender sys- resent two different classes of machine learning methods largely tems. According to the results obtained in the first steps of the case employed in recommender systems, probabilistic classification study developed in Section 5, we have driven this work on evalu and rule-based classification, respectively. The third one, CBA, rep- ating classification based on association. resents associative classifiers and was chosen basically because it presented considerably greater precision than the other two asso- on association algorithms in recommender systems. First, due to it ciative classifiers when tested on recommender systems data, is an off-line model and, consequently, the processing time an especially on the datasets of Bookcrossing. scalability are not a major concern on recommender systems In order to calculate the false positive rate of a given class Ci, we employing associative classifiers. Likewise, sparsity may be allevi- employed the same approach defined by Fawcett(2003), which is ated as well, because their features proved to be (according to the stated as follows experiments made in Sections 5.2 and 5.3)less susceptible to pres FP rate= negatives incorrectly classified /total negatives ent a radical loss of precision on sparse datasets. Since a recommender model using associative classif The false positive rate is also called"false alarm rate", because it rules considering several users'attributes, the gray sheep problem counts negative instances(samples not owning to the class cu is also likely to be alleviated. In addition, the analyst may set up the being analyzed that were incorrectly classified. On the other hand, input parameters of the algorithm in order to generate more rules the true positives represent the instances owing to cu that were that would likely be suitable for more users. Furthermore, the correctly classified (also called hits). An ideal"conservative classi- number of false positives is reduced dramatically when we tested fier"is the one which classifies positives instances(the ones own- the CBa algorithm(Section 5.5). g to c, only with strong evidence, as consequence they make By means of the experiments described in Section 5.4, we argue few false positive errors, conversely its true positive rates is re- that CBa is more effective than CMAR on sparser datasets. Thus, Please cite this article in press as: Pinho Lucas. ] et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender sys- tems Expert Systems with Applications(2011). doi: 10.1016/jeswa. 2011.07 136
Therefore, we can conclude that the CBA algorithm is more appropriate than CMAR to be applied in recommender systems. Results have shown that CMAR is less effective on sparser datasets (more distinct values or less number of records). This may be justified by the data structure it employs, which is a FP-Tree (Frequent Pattern Tree). Such structure stores frequent itemsets in a compact way in which common relations between itemsets are explored. So, items stored in the FP-Tree were not frequent enough to generate rules. Besides, usually every new associative classifier paper uses CBA as reference to make comparative studies and to validate their algorithms. The main contribution of recent associative classifiers is related to memory usage and processing time, because, in general, accuracy is higher just in some datasets and such increment is not very substantial. Subsequently, we are going to consider, by means of false positive occurrence on algorithms, more advantages on employing associative classifiers on recommender systems. 5.5. Analyzing false positives occurrence According to what was described in Section 4, designers of recommender methods should have a major concern on avoiding false positive errors. In this context, in this subsection we describe some experiments we made in order to compare the false positive rates of associative classifiers and traditional classifiers. To do so, we chose three algorithms: Bayes Net, C4.5 and CBA. The first two represent two different classes of machine learning methods largely employed in recommender systems, probabilistic classification and rule-based classification, respectively. The third one, CBA, represents associative classifiers and was chosen basically because it presented considerably greater precision than the other two associative classifiers when tested on recommender systems data, especially on the datasets of BookCrossing. In order to calculate the false positive rate of a given class c1, we employed the same approach defined by Fawcett (2003), which is stated as follows: FP rate ¼ negatives incorrectly classified=total negatives The false positive rate is also called ‘‘false alarm rate’’, because it counts negative instances (samples not owning to the class c1 being analyzed) that were incorrectly classified. On the other hand, the true positives represent the instances owing to c1 that were correctly classified (also called hits). An ideal ‘‘conservative classi- fier’’ is the one which classifies positives instances (the ones owning to c1) only with strong evidence, as consequence they make few false positive errors, conversely its true positive rates is reduced as well, and therefore, the precision is also reduced. Nevertheless, conservative classifiers are appropriate for a recommender system scenario, in which false positives need to be avoided. It should be noted that, generally, associative classifiers do not consider a default rule to classify an instance which would not match any rule generated, as would be done in other traditional classifiers. This would indeed lead to recommend an item that does not match the user’s needs. This way, an associative classifier does not classify a user whose data does not match with any rule generated. Conversely, other classifiers always classify the active user as they classify every sample provided as input. Table 10 shows the false positive rates obtained by Bayes Net, C4.5, CBA on the same datasets employed in the previous subsection, where we also set the same confidence and support threshold values. Table 10 shows that Bayes Net presented the greatest false positive rates among all algorithms studied. This suggests that Bayes Net is more susceptible to the occurrence of false positive errors than rule-based classifiers. Taking into account the CBA algorithm, it obtained, in average, a false positive rate of 10% lesser than C4.5, which was superior to Bayes Net. In addition to that, the false positive rate for the ‘‘BCrossing USA’’ dataset was 21.12% and in comparison with Bayes Net and C4.5. This scenario shows the use of associative classifiers is very appropriate for avoiding false positives occurrence in recommender systems, given that they were dramatically reduced using the CBA algorithm. 6. Conclusion In this work we focused on revealing how the most typical and critical drawbacks of recommender systems may avoided and alleviated. To do so, we tried some machine learning, especially classi- fication based on association, methods on diverse types of data. We also used data gathered from real recommender systems, including the BookCrossing, which is a data base concerning a domain that has never been used before in case studies on recommender systems. According to the results obtained in the first steps of the case study developed in Section 5, we have driven this work on evaluating classification based on association. There are several advantages on employing classification based on association algorithms in recommender systems. First, due to it is an off-line model and, consequently, the processing time and scalability are not a major concern on recommender systems employing associative classifiers. Likewise, sparsity may be alleviated as well, because their features proved to be (according to the experiments made in Sections 5.2 and 5.3) less susceptible to present a radical loss of precision on sparse datasets. Since a recommender model using associative classification has general rules considering several users’ attributes, the gray sheep problem is also likely to be alleviated. In addition, the analyst may set up the input parameters of the algorithm in order to generate more rules that would likely be suitable for more users. Furthermore, the number of false positives is reduced dramatically when we tested the CBA algorithm (Section 5.5). By means of the experiments described in Section 5.4, we argue that CBA is more effective than CMAR on sparser datasets. Thus, Table 8 Results of CBA. Dataset Accuracy (%) Num. rules Support (%) Density BCrossing USA 80.24 15.9 5 0.18 BCrossing USA 10 80.56 12.3 10 0.7 BCrossing World 79.43 17.7 5 0.062 BCrossing World 10 79.26 10.9 10 0.36 MovieLens 72.77 10.3 15 0.86 Table 9 Results of CMAR. Dataset Accuracy (%) Num. rules Support (%) Density BCrossing USA 71.26 18.7 3 0.18 BCrossing USA 10 69.58 16 5 0.7 BCrossing World 61.7 17.1 2 0.062 BCrossing World 10 42.34 11.3 3 0.36 MovieLens 75.63 11.7 10 0.86 Table 10 False positive rates. Dataset Bayes Net (%) C4.5 (%) CBA (%) MovieLens 47.4 42.6 33.22 BCrossing World 45.15 39.9 23.03 BCrossing World 10 40.3 42.45 33.83 BCrossing USA 47.45 41.55 20.43 BCrossing USA 10 48.25 44 34.66 10 J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136