正在加载图片...
ARTICLE IN PRESS current associative classifiers. Conversely. CMAR was presents a modern generation of these classi- ive data structure for storing classifica- et al. (2001), makes able to reduce e precision. At last, ociative classi- nsets. shold: 859 luced the ucture it USA)rem remained we also us values, de both datas for author a USA)and 3238 r Bayes Net C4.5 CBA CPAR CMAR ((%(%)(%) 2.881474.0785.16 World 80.87 80.21794773.2570.43 World 80.51 799880.5279.8641.41 80.23 12 81.3180.2878.15776 gus1081.53 80.8281.5676.7169.59 classifiers in order to alleviate typical drawbacks in recommender sys-BookCrossing data even more interesting and innovative to test recommender methods. In this context, Herlocker et al. (2004) ver￾ified that different algorithms designed for recommender systems may be better or worse on different datasets and, in addition, many algorithms have been designed specifically for datasets having many more users than items. As a consequence, such algorithms may present a completely different performance when applied to datasets that do not own an abundant quantity of users per item or ratings per item. So that, testing recommender algorithms only using the MovieLens data might not take trustful evaluations for the algorithms use in general recommender systems. However, as the BookCrossing data was not used before for test￾ing recommender algorithms, some efforts was required for pre￾paring data to be supplied for the algorithms input. In the rest of this subsection, we describe the main approaches employed for performing data pre-processing and transformation. In order to simplify the classification, the rating attribute on the BookCrossing database was modified in the same way as the MovieLens was: ‘‘Not recommended’’ (score from 1 or 6) and ‘‘Rec￾ommended’’ (score from 7 to 10). For books’ data, we used two attributes from the dataset: publication Year and Author. The first was discretized in five ranges. The Author attribute was also mod￾ified, because at first it encompassed 48,234 distinct values. Thus, the dataset was reduced in order to this attribute encompasses only 40 distinct values (the ones that appear on more records). Taking into account users’ data, we also used two attributes: Age and Place where the user inhabits. The first was discretized in nine age ranges. The Place attribute originally contained the name of the city, the state or province, and the name of the country. However, in this way such attribute presented 12,952 distinct values. There￾fore, we changed this attribute in order to encompass only 40 dis￾tinct values. For that reason and noticing that 75% of the places were from USA, we divided the dataset, based on this attribute, in two: places grouped by states of USA and places grouped by countries excepting USA. Afterwards, the first dataset (states of USA) remained with 25,523 records and the second one (countries) remained with 8926 records. In order to perform the case study on even more diverse data, we also used two more datasets, owing a smaller range of distinct values, derived from those mentioned before. So that, we copied both datasets and kept only 10 distinct values (the most frequent) for author and Country/State attributes. This way, we obtained two more datasets that contain 6270 records (on the dataset of states of USA) and 3238 records (on the dataset of countries). 5.2. Associative classifiers vs. general classification algorithms In this subsection we describe some experiments done intend￾ing to test classification algorithms, especially associative classifi- ers, on real recommender systems data. To do so, we compared three associative classifiers (CBA, CPAR and CMAR) with two tradi￾tional classifiers (Bayes Net and C4.5), where Bayes Net is a prob￾abilistic algorithm and C4.5 a decision tree based algorithm. The first two were run through WEKA and the other three were ob￾tained from LUCS-KDD Software Library, from the University of Liverpool. Bayes Net and C4.5 were chosen because, in addition to be classification methods widely employed in recommender systems, they represent two groups of machine learning methods: probabilistic classification (Bayes Net) and rule-based classification (C4.5). On the other hand, besides the existence of few accurate and effective associative classifiers, there are even fewer imple￾mentations available. In fact, the LUKS-KDD Software Library was the only software repository we found offering associative classifi- ers for free use. Therefore, we only employed three classification based on association algorithms in this case study. CBA was chosen because it was the first associative classifier built and its concepts are in most current associative classifiers. Conversely, CMAR was chosen because it represents a modern generation of these classi- fiers. It employs an alternative data structure for storing classifica￾tion rules, which, according to Li et al. (2001), makes able to reduce processing time and, in some cases, increase precision. At last, CPAR was chosen because it is another popular associative classi- fier proposed more recently. In order to perform experiments, we defined a very low support threshold value (1%), for running classification based on associa￾tion algorithms, to be able to obtain enough frequent itemsets. Conversely, we defined high values of confidence threshold: 85% to apply CBA and CPAR and 70% to apply CMAR. We reduced the confidence threshold to apply CMAR because the data structure it employs, a FP-Tree (Frequent Pattern Tree), stores frequent item￾sets in a compact way in which common relations between item￾sets are explored. In this way, items need to be frequent enough to be stored in the FP-Tree and to be considered at first time. For the datasets of BookCrossing containing 10 distinct values for Author and Country/State attributes, we increased the support threshold to 5% due to its reduced number of records. In Table 1 we show the results obtained after running the algo￾rithms mentioned above. Each line depicts the accuracy obtained on each classifier, which is defined as the percentage of the correct classified samples among the whole data taken into account. Results revealed that the associative classifiers reached similar accuracy, excepting CMAR on BookCrossing data, to traditional clas￾sifiers (supervised learning). Actually, in some cases associative classifiers reached higher accuracy. Despite the fact of being the first method of classification based on association, the CBA algo￾rithm reached the highest accuracy on two of the four datasets of BookCrossing. On MovieLens data, CMAR reached the highest accu￾racy, which was the best result obtained over all the experiments. Since rules provided by the associative classifiers hold a high confidence value (equal or greater than 70% or 85%), the rules used for building the classification models are reliable. The ninth rule generated by CMAR on MovieLens data is an example of this kind of rule: ‘‘age = (Li et al., 2001; Liu et al., 1998; Lucas & P, 2008; Moreno et al., 2008; Neves, 2003; Rittman, 2005; Sarwar et al., 2000, 2001; Schafer, 2005; Schafer et al., 2001) & genre = ‘drama’ ? rating = ‘yes’’’. Such rule states that, if a user is older than 25 years and younger than 34 years old, he will probably rate pos￾itively a drama movie. Despite of presenting the highest precision over all experiments (85.16% on MovieLens data), CMAR presented lower precision than other classifiers on the BookCrossing datasets. MovieLens and BookCrossing data basically differ on the number of distinct values of their attributes. MovieLens has only two distinct values on the Genre attribute, for example, and the other attributes have, in gen￾eral, less distinct values than MovieLens datasets when the ratio of records/number of distinct values is taken into consideration. Moreover, it has a ratio of 59.45 ratings per item, which is consid￾erably greater than the ratio of MovieLens (2.33 ratings per item). Taking into account the datasets of BookCrossing, when com￾paring the datasets of ‘‘world countries’’ with the datasets of Table 1 Comparison of classifiers. Data Bayes Net (%) C4.5 (%) CBA (%) CPAR (%) CMAR (%) MovieLens 81.95 82.88 81.4 74.07 85.16 BCrossing World 80.87 80.21 79.47 73.25 70.43 BCrossing World 10 80.51 79.98 80.52 79.86 41.41 BCrossing USA 80.23 81.31 80.28 78.15 77.66 BCrossing USA 10 81.53 80.82 81.56 76.71 69.59 J. Pinho Lucas et al. / Expert Systems with Applications xxx (2011) xxx–xxx 7 Please cite this article in press as: Pinho Lucas, J., et al. Making use of associative classifiers in order to alleviate typical drawbacks in recommender sys￾tems. Expert Systems with Applications (2011), doi:10.1016/j.eswa.2011.07.136
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有