Expert Systems with Applications 38(2011)15344-15355 Contents lists available at Science Direct Expert Systems with Applications ELSEVIER journalhomepagewww.elsevier.com/locate/eswa An implementation and evaluation of recommender systems for traveling abroad Dong-Her Shih, David C. Yenb. *. Ho-Cheng Lin, Ming-Hung Shih of Information Management, National Yunlin University of Science and Technology, 123, Section 3, Universiry Road, Douliu, Yunlin, Taiwan, ROC of DSC E MiS, Farmer School of Business, Miami University, Oxford, OH 45056, USA Dep of electrical and Computer Engineering. NC State University, raleigh, NC 27695, USA ARTICLE INFO ABSTRACT The improvement of information technology makes storage no longer a problem. In addition, the birth of the Internet makes information transfer faster than ever. It brings us convenient life. However, more and ve filtering more information result in a new problem, which is information overload. Today, many more people ar traveling abroad since they no longer have to work on weekends. Traveling abroad has become a kind of Recommender system undred countries in the world worth to travel. and there is so much infor- mation available that it makes a travelers decision extremely difficult to make In our research, we try to implement the most common three kinds of recommender system techniques in order to recommend to customers which countries are the best traveling locations for them. Thus, we can save travelers a lot of time when deciding where to go From our experiment and evaluation, we find that a hybrid recom- lender system is a better technique in recommendation according to our abroad database, and it con- quers the shortcomings of content-based filtering and collaborative filtering approaches e 2011 Elsevier Ltd. All rights reserved. 1 Introduction to provide personalized information services (Schafer, Konstan Riedl, 2001): retrieving the information, consumer desires. of the In plosion of e-commerce in recent years, nd helps them determine which one to buy. A recommender For firms this makes it easy to develop a one-to-one niques to the problem of helping customers find the products business style. One of the important issues is that they would like to purchase by producing a predicted likeness hould establish the relationship between customers and itself, score or a list of recommended products for a given customer and provide appropriate information and products that match (Sarwar et al., 1998). It has been used in many Websites to rec- the interests of customers. The need for new marketing strategies ommend various items including movies, music, news, articles, in e-commerce, such as one-to-one marketing, Web personaliza- books, software, computers, etc(see Fig. 1). There are three ap- tion, and customer relationship management has been stressed proaches for building recommender systems which are content both in research as well as in practice( Mobasher, Cooley, Srivast based recommending(CBF), collaborative filtering(CF)and hybrid ava, 2000: Sarwar, Karypis, Konstan, Reidl, 2000) filtering. It is important to interact with customers and provide them One advantage to the personalized recommender system is with personalized service and communication. Such customer that consumers can immediately access the information they nteractions can transform customer information into quality ser- are interested in, and save their time to prevent reading the ces or products(Weng liu, 2004). For customer relationship overload information On the other hand, enterprises can collect management, one-to-one marketing is one of the most effective customers' buying behaviors and then develop appropriate approaches to enhance customer satisfaction, loyalty, and marketing strategies to attract different customers and efficiently reputation. deliver the information they are interested in. The customers Because of the rapid spread of the Internet, information satisfaction and loyalty will thus be increased, and the load has become a serious problem. One way to overco increase in the visiting frequency of the customers can further above problem is to develop an intelligent recommender create more transaction opportunities and benefit the Internet enterprises. Many more people are traveling abroad since they no longer ding author.Tel:+15135294827;fax:+15135299689. have to work during the weekends, which have lead to a rapid in- (D.C. Yen. 89423719eyuntechedu tw(H-C. Lin), dannysmhegmail com crease in the growth of the traveling population. The importance of leisure time is increasing, and there is a tendency toward traveling 0957-4174 front matter o 2011 Elsevier Ltd. All rights reserved o:10.1016/eswa2011.060
An implementation and evaluation of recommender systems for traveling abroad Dong-Her Shih a , David C. Yen b,⇑ , Ho-Cheng Lin a , Ming-Hung Shih c aDepartment of Information Management, National Yunlin University of Science and Technology, 123, Section 3, University Road, Douliu, Yunlin, Taiwan, ROC bDepartment of DSC & MIS, Farmer School of Business, Miami University, Oxford, OH 45056, USA cDepartment of Electrical and Computer Engineering, NC State University, Raleigh, NC 27695, USA article info Keywords: Content-based filtering Collaborative filtering Hybrid filtering Taxonomy Recommender system Bayes’ theorem abstract The improvement of information technology makes storage no longer a problem. In addition, the birth of the Internet makes information transfer faster than ever. It brings us convenient life. However, more and more information result in a new problem, which is information overload. Today, many more people are traveling abroad since they no longer have to work on weekends. Traveling abroad has become a kind of trend. There are more than a hundred countries in the world worth to travel, and there is so much information available that it makes a traveler’s decision extremely difficult to make. In our research, we try to implement the most common three kinds of recommender system techniques in order to recommend to customers which countries are the best traveling locations for them. Thus, we can save travelers a lot of time when deciding where to go. From our experiment and evaluation, we find that a hybrid recommender system is a better technique in recommendation according to our abroad database, and it conquers the shortcomings of content-based filtering and collaborative filtering approaches. 2011 Elsevier Ltd. All rights reserved. 1. Introduction Due to an explosion of e-commerce in recent years, the rapid spread of the Internet has made our world move faster than ever. For firms this makes it easy to develop a one-to-one marketing business style. One of the important issues is that companies should establish the relationship between customers and itself, and provide appropriate information and products that match the interests of customers. The need for new marketing strategies in e-commerce, such as one-to-one marketing, Web personalization, and customer relationship management has been stressed both in research as well as in practice (Mobasher, Cooley, & Srivastava, 2000; Sarwar, Karypis, Konstan, & Reidl, 2000). It is important to interact with customers and provide them with personalized service and communication. Such customer interactions can transform customer information into quality services or products (Weng & Liu, 2004). For customer relationship management, one-to-one marketing is one of the most effective approaches to enhance customer satisfaction, loyalty, and reputation. Because of the rapid spread of the Internet, information overload has become a serious problem. One way to overcome the above problem is to develop an intelligent recommender system to provide personalized information services (Schafer, Konstan, & Riedl, 2001): retrieving the information, consumer desires, and helps them determine which one to buy. A recommender system is the information filtering that applies data analysis techniques to the problem of helping customers find the products they would like to purchase by producing a predicted likeness score or a list of recommended products for a given customer (Sarwar et al., 1998). It has been used in many Websites to recommend various items including movies, music, news, articles, books, software, computers, etc (see Fig. 1). There are three approaches for building recommender systems which are contentbased recommending (CBF), collaborative filtering (CF) and hybrid filtering. One advantage to the personalized recommender system is that consumers can immediately access the information they are interested in, and save their time to prevent reading the overload information. On the other hand, enterprises can collect customers’ buying behaviors and then develop appropriate marketing strategies to attract different customers and efficiently deliver the information they are interested in. The customer’s satisfaction and loyalty will thus be increased, and the increase in the visiting frequency of the customers can further create more transaction opportunities and benefit the Internet enterprises. Many more people are traveling abroad since they no longer have to work during the weekends, which have lead to a rapid increase in the growth of the traveling population. The importance of leisure time is increasing, and there is a tendency toward traveling 0957-4174/$ - see front matter 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.06.030 ⇑ Corresponding author. Tel.: +1 513 529 4827; fax: +1 513 529 9689. E-mail addresses: shihdh@yuntech.edu.tw (D.-H. Shih), yendc@muohio.edu (D.C. Yen), g9423719@yuntech.edu.tw (H.-C. Lin), dannysmh@gmail.com (M.-H. Shih). Expert Systems with Applications 38 (2011) 15344–15355 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
D-H. Shih et al. Expert Systems with Applications 38(2011)15344-15355 15345 Examples of Recommendation I Your Lits oI Hap.I of Music (40th Anniversary D(+ Price: 514.9/&elgible for FREE Super sever More Duying choices You 2used我 now from s11,0 In Stock. ships from and so by Amaan co Cel yanhe The wizard of oz (Three- Disc Colectcr's Editon DvD w Burke C Ad te Wish List an-Doppins(ath Annysnan Ed hon) DVD Julie Andrews The King and L(soth Anniversary Edition) DVD w Deborah Kerr Expore simllar items: pyD Ise MSC Fig.1.www.amazon.com. Peoples of traveling abroad score or a list of top-N recommended items for a given user. a rec 14,000,000 ommendation system can provide personalized information ser 000 10,000,000 been recording and analyzing a customer's previous preferences. 8000,000 Hence, there are three general types of recommender systems 000 which are content-based approach, collaborative filtering approach and hybrid filtering approach. Among them, collaborative filtering 4.000.000 is the most popular personalized recommendation method widely 2,000,000 2000 2005 2. 1. Personalization Fig. 2. Statistic data of a national who travel abroad between 1992 and 2009 The term'personalization is often used in the context of recom- lender systems that selectively promote products to end-users g to the Tourism Bureau of Taiwan, there are 8 based on the analysis of earlier interactions(Schafer, Konstan, veling abroad. Much more people are traveling Riedl, 1999). Personalization means a Website can provide a cus in Fig. 2(Source: taiwan. net. tw) tomer unique and particular needs. Mobasher et al. (2000). Mob ' e have implemented three approaches for build- asher, Dai, and Luo(2002), Mobasher, Dai, Luo, and Nakagawa ing recommender systems-content-based recommending, collab-(2001)defined Web personalization as an act of response accord- orative filtering and hybrid filtering to recommend the traveling ing to the individual user's interest and hobby on Internet usage. untries. In our experiment we use the real data to evaluate the Through personalization, businesses can predict a customers ses three approaches to determine which one is better. behaviors through their past purchasing records, and demographic This paper has three primary research contributions: data. Accordingly, companies can develop more appropriate mar keting strategies to fit each customer by providing suitable infor- 1. Develop a recommender technique for on-line traveling mation and products/services to customers. Customers satisfaction and loyalty can thus be enhanced, and the increase 2. Presentation of a hybrid method, collaborative filtering method, in each customer's visiting frequency can further create more and content-based method to discuss the advantages and transaction opportunities and benefit the Internet businesses (Lee, Liu, Lu, 2002). 3. Evaluate the effect of different variables in these three method The remainder of the paper is organized as follows In Section 2 2. 2. Recommender systems related work is expatiated, including personalization and recom- mender system is defined as the system which tems.The elementary theoretical background is mends an appropriate product or service to certain customers provided in Section 3, followed by Section 4 explaining the experi- ment and results. Finally, the conclusion is given in Section 5 ccording to customers need ay, more and more researchers are studying recommender systems. The most important factor in a recommender system is how to analyze customers behavior. 2 Related work therefore the system will recommend products based on an accu- rate estimation approach( Sarwar et al, 2000, 2001). The key sim- ilarity measures, which are used in the recommender system, such as cosine similarity, Pearson correlation, NB classifier, Euclidean sites by producing a predicted like distance
abroad. According to the Tourism Bureau of Taiwan, there are 8.2 million people traveling abroad. Much more people are traveling abroad as shown in Fig. 2 (Source: taiwan.net.tw). In this paper, we have implemented three approaches for building recommender systems – content-based recommending, collaborative filtering and hybrid filtering to recommend the traveling countries. In our experiment we use the real data to evaluate theses three approaches to determine which one is better. This paper has three primary research contributions: 1. Develop a recommender technique for on-line traveling ebusiness. 2. Presentation of a hybrid method, collaborative filtering method, and content-based method to discuss the advantages and disadvantages. 3. Evaluate the effect of different variables in these three methods. The remainder of the paper is organized as follows. In Section 2, related work is expatiated, including personalization and recommender systems. The elementary theoretical background is provided in Section 3, followed by Section 4 explaining the experiment and results. Finally, the conclusion is given in Section 5. 2. Related work Recommender systems apply data analysis techniques to the problem of helping users find the items they would like to purchase at e-commerce sites by producing a predicted likeliness score or a list of top-N recommended items for a given user. A recommendation system can provide personalized information services in different ways; it depends on whether the system has been recording and analyzing a customer’s previous preferences. Hence, there are three general types of recommender systems which are content-based approach, collaborative filtering approach and hybrid filtering approach. Among them, collaborative filtering is the most popular personalized recommendation method widely in recommender systems. 2.1. Personalization The term ‘personalization’ is often used in the context of recommender systems that selectively promote products to end-users based on the analysis of earlier interactions (Schafer, Konstan, & Riedl, 1999). Personalization means a Website can provide a customer unique and particular needs. Mobasher et al. (2000), Mobasher, Dai, and Luo (2002), Mobasher, Dai, Luo, and Nakagawa (2001) defined Web personalization as an act of response according to the individual user’s interest and hobby on Internet usage. Through personalization, businesses can predict a customer’s behaviors through their past purchasing records, and demographic data. Accordingly, companies can develop more appropriate marketing strategies to fit each customer by providing suitable information and products/services to customers. Customer’s satisfaction and loyalty can thus be enhanced, and the increase in each customer’s visiting frequency can further create more transaction opportunities and benefit the Internet businesses (Lee, Liu, & Lu, 2002). 2.2. Recommender systems A recommender system is defined as the system which recommends an appropriate product or service to certain customers according to customer’s need. Today, more and more researchers are studying recommender systems. The most important factor in a recommender system is how to analyze customer’s behavior, therefore the system will recommend products based on an accurate estimation approach (Sarwar et al., 2000, 2001). The key similarity measures, which are used in the recommender system, such as cosine similarity, Pearson correlation, NB classifier, Euclidean distance. Fig. 1. www.amazon.com. Fig. 2. Statistic data of a national who travel abroad between 1992 and 2009. D.-H. Shih et al. / Expert Systems with Applications 38 (2011) 15344–15355 15345
15346 Recommender systems are often used in e-commerce Websites Collaborative filtering based on the user( Resnick, lacovo, suggest products or services to their customers and provide con- Suchak, Bergstrom, riedl, 1994: Sarwar et al. 2000: Shardanand umers appropriate information to fit the users. The number of Maes, 1995)is the most successful recommending technique to e-commerce businesses is increasing in the adoption of recom- date, and is extensively used in many commercial recommender mender system technologies in their Websites. The most famous systems(Liu, Lai, Lee 2009: Shih, Chiang, Lin, 2008). Recom threeWebsitesareAmazon.com,ebayandgoogle.com. mender systems based on CF-U compute the top-N recommended Most recommendation techniques fall into two categories, items for that user as follows. First, they identify the k most similar namely content-based filtering and collaborative filtering(special users in the database. This is often done by modeling users and issue on information filtering). Recently, hybrid measure becomes items with the vector-space model, which is widely used for infor- significant recommendation technique. Therefore, three major mation retrieval (Sarwar et al, 2000). In this model each of the n approaches are used for processing input data and formulating users as well as the active user is treated as a vector in the m- the prediction: collaborative filtering(CF), content-based filtering dimensional item space, and the similarity of active user to existing (CBF) and hybrid filtering approach users is measured by computing the cosine between these vectors or correlation 2.2.1. Content-based filtering To address the scalability concerns of CF-U algorithms an Content-based filtering makes predictions by analyzing a user's vide better explaining for recommendation to users, collaborative previous preferences or interests which would be the obvious indi- filtering based on item(CF-1)techniques have been developed cators for user's future behavior. CBF requires that items are de-(Billsus Pazzani, 1998: Sarwar et al., 2000). These approaches scribed by features, and is typically applied upon text-based analyze the user-item matrix to identify relations between the dif- documents, or in domains with structured data(Khribi, Jenni, ferent items, and then use these relations to compute the list of Nasraoui, 2009: Pazzani, 1999). Next, the relevance of a given con- top-N recommendations tent item and the users interest profile is measured against the similarity of this recommendable item to the users interest profile 2.2.3. Hybrid Finally, items that have a high degree of similarity to the users Hybrid recommender systems combine two or more recom- interest protle are recommended to the user. For example, con- mendation techniques to gain better performance with fewer of tent-based filtering has been utilized in book recommendation the drawbacks of any individual one(Liu et al. 2009). Most com- tasks(Mooney Roy, 2000), using features sue as title. aut monly, collaborative filtering is combined with some other tech or theme. In such cases, the user's previous preferences on the nique in an attempt to avoid the ramp-up problem.Balabanov'c respective features are used to filter the available books and rec- and Shoham(1997)apply"Selection agent", which decides the rec- ommend the most relevant to the user Content-based filtering is ommendation algorithm between content-based filtering and CF typically applied to recommend products that have analyzable Pazzani(1999) shows the hybrid approach for recommendation content or descriptions, such as books (Mooney Roy, 2000). that uses more of the available information and consequently has A customer,'s personal information is first collected, and then more precise recommendations. The strengths of the different he system reasons out the customer's preferences by analyzing proaches can be complementary e con- All these approaches that have been applied in different do- sumer's personal information is obtained, the recommender sys- mains are shown in table 1 tem can then construct a computational model to predict a users preference for other items of the same application domain. In fact, the work of recommendation can be regarded as classification: 3. Methodology using the known information already to set up a model to predict the unknown events(Lee et al., 2002). In this paper, we have implemented three generalized recom mending techniques for recommending to customers which coun- 2.2.2. Collaborative filtering tries are the best traveling location for them. Thus, travelers can collaborative filtering is a method for calculating expected user save a lot of time by removing hesitations and having the ability preference for a product, using evaluation by, or the preferences of, to make a quicker more efficient decision. In every recommending other users who have experienced the product( Billsus Pazzani 1998: Goldberg, Nichols, Oki, Terry 1992: Konstan et al Table 1 1997). CF is designed for the less frequently-purchased products Classification of recommender system. It is currently widely applied and used for various products such Method as music or movies(Billsus Pazzani, 1998: Goldberg et al Content based Products Lawrence, Almasi, Kotlyar, viveros, and 1992: Konstan et al, 1997). The basic input data consist of the pref Duri(2001) erence matrix between users and products: to collect explicit user preferences for this input data, a purchasing intention or implicit e-Learning Khribi et al. (2009) preference, such as an inquiry or visit, may be used. Similarity collaborative Movies Resnick et al. (1994)and among users is calculated by the Pearson correlation coefficient filteing or the cosine measure(Konstan et al, 1997: Mild Natter, 2002 Linden, Smith, and York(2003). Sarwar et al., 2000) based on the similarity calculation and simila Cho and kin atures, we can find neighbors to a particular user We can calcu Jeon(2006)and Liu et al. (2009 late a user's preference for a product based on his or her average Music Hayes and Cunningham(200 Kim, Lee, Cho, and Kim(2004) preference for other products and his or her neighbors ' preference for the product( Khribi et al, 2009: Konstan et al, 1997: Mild Natter, 2002: Sarwar et al, 2000). In collaborative filtering. the neighbor algorithm requires computation that grows wi hih et al. (2008) both the number of customers and the number of products, an e-Learning Khribi et al. (2009) as a sparsity problem; if there are few user preferences, its recom- Movies Schein ul, Ungar, and Pennock(2002) mendation performance is low(Sarwar et al., 2000). Products Liu and shih(2005)and Liu et al. (2009)
Recommender systems are often used in e-commerce. Websites suggest products or services to their customers and provide consumers appropriate information to fit the users. The number of e-commerce businesses is increasing in the adoption of recommender system technologies in their Websites. The most famous three Websites are: Amazon.com, eBay, and google.com. Most recommendation techniques fall into two categories, namely content-based filtering and collaborative filtering (special issue on information filtering). Recently, hybrid measure becomes a significant recommendation technique. Therefore, three major approaches are used for processing input data and formulating the prediction: collaborative filtering (CF), content-based filtering (CBF) and hybrid filtering approach. 2.2.1. Content-based filtering Content-based filtering makes predictions by analyzing a user’s previous preferences or interests which would be the obvious indicators for user’s future behavior. CBF requires that items are described by features, and is typically applied upon text-based documents, or in domains with structured data (Khribi, Jemni, & Nasraoui, 2009; Pazzani, 1999). Next, the relevance of a given content item and the user’s interest profile is measured against the similarity of this recommendable item to the user’s interest profile. Finally, items that have a high degree of similarity to the user’s interest profile are recommended to the user. For example, content-based filtering has been utilized in book recommendation tasks (Mooney & Roy, 2000), using features such as title, author, or theme. In such cases, the user’s previous preferences on the respective features are used to filter the available books and recommend the most relevant to the user. Content-based filtering is typically applied to recommend products that have analyzable content or descriptions, such as books (Mooney & Roy, 2000). A customer’s personal information is first collected, and then the system reasons out the customer’s preferences by analyzing and modeling the available personal information. Once the consumer’s personal information is obtained, the recommender system can then construct a computational model to predict a user’s preference for other items of the same application domain. In fact, the work of recommendation can be regarded as classification: using the known information already to set up a model to predict the unknown events (Lee et al., 2002). 2.2.2. Collaborative filtering Collaborative filtering is a method for calculating expected user preference for a product, using evaluation by, or the preferences of, other users who have experienced the product (Billsus & Pazzani, 1998; Goldberg, Nichols, Oki, & Terry, 1992; Konstan et al., 1997). CF is designed for the less frequently-purchased products. It is currently widely applied and used for various products such as music or movies (Billsus & Pazzani, 1998; Goldberg et al., 1992; Konstan et al., 1997). The basic input data consist of the preference matrix between users and products; to collect explicit user preferences for this input data, a purchasing intention or implicit preference, such as an inquiry or visit, may be used. Similarity among users is calculated by the Pearson correlation coefficient or the cosine measure (Konstan et al., 1997; Mild & Natter, 2002; Sarwar et al., 2000). Based on the similarity calculation and similar features, we can find neighbors to a particular user. We can calculate a user’s preference for a product based on his or her average preference for other products and his or her neighbors’ preference for the product (Khribi et al., 2009; Konstan et al., 1997; Mild & Natter, 2002; Sarwar et al., 2000). In collaborative filtering, the nearest neighbor algorithm requires computation that grows with both the number of customers and the number of products, and has a sparsity problem; if there are few user preferences, its recommendation performance is low (Sarwar et al., 2000). Collaborative filtering based on the user (Resnick, Iacovou, Suchak, Bergstrom, & Riedl, 1994; Sarwar et al., 2000; Shardanand & Maes, 1995) is the most successful recommending technique to date, and is extensively used in many commercial recommender systems (Liu, Lai, & Lee 2009; Shih, Chiang, & Lin, 2008). Recommender systems based on CF-U compute the top-N recommended items for that user as follows. First, they identify the k most similar users in the database. This is often done by modeling users and items with the vector-space model, which is widely used for information retrieval (Sarwar et al., 2000). In this model each of the n users as well as the active user is treated as a vector in the mdimensional item space, and the similarity of active user to existing users is measured by computing the cosine between these vectors or correlation. To address the scalability concerns of CF-U algorithms and provide better explaining for recommendation to users, collaborative filtering based on item (CF-I) techniques have been developed (Billsus & Pazzani, 1998; Sarwar et al., 2000). These approaches analyze the user-item matrix to identify relations between the different items, and then use these relations to compute the list of top-N recommendations. 2.2.3. Hybrid Hybrid recommender systems combine two or more recommendation techniques to gain better performance with fewer of the drawbacks of any individual one (Liu et al., 2009). Most commonly, collaborative filtering is combined with some other technique in an attempt to avoid the ramp-up problem. Balabanov’c and Shoham (1997) apply ‘‘Selection agent’’, which decides the recommendation algorithm between content-based filtering and CF. Pazzani (1999) shows the hybrid approach for recommendation that uses more of the available information and consequently has more precise recommendations. The strengths of the different approaches can be complementary. All these approaches that have been applied in different domains are shown in Table 1. 3. Methodology In this paper, we have implemented three generalized recommending techniques for recommending to customers which countries are the best traveling location for them. Thus, travelers can save a lot of time by removing hesitations and having the ability to make a quicker more efficient decision. In every recommending Table 1 Classification of recommender system. Method Domain Authors Content based filtering Products Lawrence, Almasi, Kotlyar, Viveros, and Duri (2001) e-Commerce Lee et al. (2002) e-Learning Khribi et al. (2009) Collaborative filtering Movies Resnick et al. (1994) and Kim and Yum (2005) Products Shardanand and Maes (1995), Linden, Smith, and York (2003), Cho and Kim (2005), Choi, Kang, and Jeon (2006) and Liu et al. (2009) Music Hayes and Cunningham (2004) Wallpaper Kim, Lee, Cho, and Kim (2004) Software Akinaga et al. (2005) News Lee and Park (2007) Music Li et al. (2007) Spam Shih et al. (2008) e-Learning Khribi et al. (2009) Hybrid Movies Schein, Popescul, Ungar, and Pennock (2002) Products Liu and Shih (2005) and Liu et al. (2009) 15346 D.-H. Shih et al. / Expert Systems with Applications 38 (2011) 15344–15355
technique, we have designed a process of recommendation follow- ng each filtering technique and divide it into two phases. Phase All one is the learning phase and phase two is the recommending test hase. Detailed recommending methodology is described as follows 3.1. Content-based filtering According to the attributes of items and user performance, an lyzing the log data to provide users with recommendation results are usually called content-based filtering(Li, Smith, Bergman, 8:855 Castelli, 1998). As we know content-based filtering is the earliest 88: ::::::::::: recommendation method. Unfortunately, this method can only recommend items that are related to history data. hence we de- Fig 4. An example of product taxonomy gned a process of recommendation following content-based fil- tering and also divide it into two phases. Phase one is the tail structure is shown in Fig. Che recommending test phase. De- several nodes at a lower level into one parent node. The root node learning phase and phase two is abeled by"All"denotes the most general product class. Fig. 4 shows an example of product taxonomy for a fashion Web 3.1.1. Phase i retailer First, we pre-process the raw data. Data will be divided into Applications of product taxonomy in data mining have been two parts-learning data and recommending test data. According emphasized by many researchers. Therefore we proposed a to the difference between the traveling locations, we adopt taxon- ontent-based filtering based traveling location hierarchy(see my to classify different traveling locations into five continents a decision tree is a tree in which each non-leaf node denotes a (America, Oceania, Europe, Asia, and Africa). The next step is and test on an attribute of cases, each branch corresponds to an out- yzing the relation between the ing locations. In addition, come of the test, and each leaf node denotes a class prediction we adopted decision tree algorithm C5.0 to classify learning data based on customer performances. The basic input data consists of he quality of a decision tree depends on both the classification gender, age, constellations, selling place, and output data accuracy and the size of the tree. There are well-known decision consists of the traveling locations. Finally, this generates a deci- tree induction algorithms such as CHAID(Kass, 1980). CART(Bei- sion model man et al., 1984). C4.5(Quinlan, 1993)and QUEST(Loh Shih, In most Web retailers, product taxonomy is available. Product 1997), etc. Applications of decision tree based classification include arget marketing, churn prediction, medical diagnosis and so on. A taxonomy is practically represented as a tree that classifies a set commercial version of C5.0 in data mining package, Clementine of products at a low level into a more general product at a higher vel. The leaves of the tree denote the product instances, and 7.0, is used in our study. non-leaf nodes denote product classes obtained by combining 3. 1.2. Phase ll /e use recommending test data as an input, and the perfor- Learning Recommend mance of customers as an output. Then the data will be processed by a decision model. Through the algorithm of a decision tree, the system will generate an output for each record. It is the fitting con- Beg tinents for the user. When combining the result with some market- policies the result then becomes the recommendation result. Here we apply the policy of recommending the top 2 traveling locations in every continent. Cu 3. 2. Collaborative filtering ep Cho and Kim(2005) said that collaborative filtering is one of the most successful recommending methods in their paper. The meth- Location New customer od also fit various data sources such as movies, Website, products classification profile software, etc In collaborative filtering, we designed a process of recommendations following collaborative filtering and also di- vided it into two phases. Phase one is the learning phase and phase two is the recommending test phase. Detail structure is shown Decision Tree P Decision Rule 3. 2.1. Phase I At first, data will be divided into two parts-learning data and recommending test data. We adopt k-means algorithm to cluster preprocessed data according to the attributes-gender, age, con- End Recommendation stellations, selling place and locus of going abroad(as shown in Fig. 10). The basic input data consists of gender, age, constellations. selling place, locus of going abroad, and the output data is traveling Fig. 3. Our proposed structure of content-based filtering
technique, we have designed a process of recommendation following each filtering technique and divide it into two phases. Phase one is the learning phase, and phase two is the recommending test phase. Detailed recommending methodology is described as follows. 3.1. Content-based filtering According to the attributes of items and user performance, analyzing the log data to provide users with recommendation results are usually called content-based filtering (Li, Smith, Bergman, & Castelli, 1998). As we know content-based filtering is the earliest recommendation method. Unfortunately, this method can only recommend items that are related to history data. Hence, we designed a process of recommendation following content-based filtering and also divide it into two phases. Phase one is the learning phase and phase two is the recommending test phase. Detail structure is shown in Fig. 3. 3.1.1. Phase I First, we pre-process the raw data. Data will be divided into two parts – learning data and recommending test data. According to the difference between the traveling locations, we adopt taxonomy to classify different traveling locations into five continents (America, Oceania, Europe, Asia, and Africa). The next step is analyzing the relation between the traveling locations. In addition, we adopted decision tree algorithm C5.0 to classify learning data based on customer performances. The basic input data consists of gender, age, constellations, selling place, and output data consists of the traveling locations. Finally, this generates a decision model. In most Web retailers, product taxonomy is available. Product taxonomy is practically represented as a tree that classifies a set of products at a low level into a more general product at a higher level. The leaves of the tree denote the product instances, and non-leaf nodes denote product classes obtained by combining several nodes at a lower level into one parent node. The root node labeled by ‘‘All’’ denotes the most general product class. Fig. 4 shows an example of product taxonomy for a fashion Web retailer. Applications of product taxonomy in data mining have been emphasized by many researchers. Therefore we proposed a content-based filtering based traveling location hierarchy (see Fig. 5). A decision tree is a tree in which each non-leaf node denotes a test on an attribute of cases, each branch corresponds to an outcome of the test, and each leaf node denotes a class prediction. The quality of a decision tree depends on both the classification accuracy and the size of the tree. There are well-known decision tree induction algorithms such as CHAID (Kass, 1980), CART (Beiman et al., 1984), C4.5 (Quinlan, 1993) and QUEST (Loh & Shih, 1997), etc. Applications of decision tree based classification include target marketing, churn prediction, medical diagnosis and so on. A commercial version of C5.0 in data mining package, Clementine 7.0, is used in our study. 3.1.2. Phase II We use recommending test data as an input, and the performance of customers as an output. Then the data will be processed by a decision model. Through the algorithm of a decision tree, the system will generate an output for each record. It is the fitting continents for the user. When combining the result with some marketing policies the result then becomes the recommendation result. Here we apply the policy of recommending the top 2 traveling locations in every continent. 3.2. Collaborative filtering Cho and Kim (2005) said that collaborative filtering is one of the most successful recommending methods in their paper. The method also fit various data sources such as movies, Website, products, software, etc. In collaborative filtering, we designed a process of recommendations following collaborative filtering and also divided it into two phases. Phase one is the learning phase and phase two is the recommending test phase. Detail structure is shown as Fig. 6. 3.2.1. Phase I At first, data will be divided into two parts – learning data and recommending test data. We adopt k-means algorithm to cluster preprocessed data according to the attributes – gender, age, constellations, selling place and locus of going abroad (as shown in Fig. 10). The basic input data consists of gender, age, constellations, selling place, locus of going abroad, and the output data is traveling Fig. 3. Our proposed structure of content-based filtering. locations (Fig. 7). Fig. 4. An example of product taxonomy. D.-H. Shih et al. / Expert Systems with Applications 38 (2011) 15344–15355 15347
15348 D-H. Shih et aL/ Expert Systems with Applications 38(2011)15344-15355 All 二乙安二 Fig. 5. Traveling location taxonomy. According to the difference of every locus, we replace each trav eling location with a number as shown in Table 2. For example: Hong Kong and macao (1)and etc. Therefore, if there is a locus from China- Vietnam Japan, the result would be(13 19, 8) Learning Recommending Next we calculate the support value of the attributes as shown in table 3 and sort all the values. Then by looking into the roc(Ed- wards Barron, 1994)weight table, we can get every attribute with its own weight. The order is gender, age, constellations, sell- Begin New customer ing place locus of going abroad px。f1e Barron and Barrett's development of a formally justifiable solu tion to the task of turning rankings of weights into weights, and even more their demonstration of the quality of the result, is the eason for defining SMARTER and writing this paper. They call their Customer weights Rank Order Centroid, or ROC, weights. The notation of this rofile paper is identical to theirs except that they call the number of attri- Preprocessing butes n while we call it K. The key ideas of the Barron-Barrett derivation are quite simple If nothing were known about the weights except their sum, set at I ight ntion. then the set of Clustering Similarity ors would be any that have that sum. If you had no prior reasor measurement prefer one weight vector to another, it would be natural (and er Determination ror-minimizing)to use equal weights. The point describing equal weights in the hyper-surface(simplex) of all possible weights is ts centroid By knowing the rank order of weights, the argument of the End Recommendation = paragraph is to change the geometric description of acceptable weights-the simplex. It is straightforward the corner points of the smaller simplex consistent Fig. 6. Our proposed structure of collaborative filtering
According to the difference of every locus, we replace each traveling location with a number as shown in Table 2. For example: Hong Kong and Macao (1) and etc. Therefore, if there is a locus from China ? Vietnam ? Japan, the result would be (13, 19, 8). Next we calculate the support value of the attributes as shown in Table 3 and sort all the values. Then by looking into the ROC (Edwards & Barron, 1994) weight table, we can get every attribute with its own weight. The order is gender, age, constellations, selling place, locus of going abroad. Barron and Barrett’s development of a formally justifiable solution to the task of turning rankings of weights into weights, and even more their demonstration of the quality of the result, is the reason for defining SMARTER and writing this paper. They call their weights Rank Order Centroid, or ROC, weights. The notation of this paper is identical to theirs except that they call the number of attributes n, while we call it K. The key ideas of the Barron–Barrett derivation are quite simple. If nothing were known about the weights except their sum, set at I by convention, then the set of possible non-negative weight vectors would be any that have that sum. If you had no prior reason to prefer one weight vector to another, it would be natural (and error-minimizing) to use equal weights. The point describing equal weights in the hyper-surface (simplex) of all possible weights is its centroid. By knowing the rank order of weights, the argument of the preceding paragraph is to change the geometric description of the set of acceptable weights – the simplex. It is straightforward to specify the corner points of the smaller simplex consistent All America Oceania Europe America 9 Canada 17 the Pacific Ocean island 3 New Zealand & Australia 5 East Europe 2 Spain, Portugal& Morocco 15 North Europe 16 Mid-west Europe 18 Asia Africa Africa 10 Hongkong & Macao 1 Malaysia & Singapore 6 Phili Indonesia 7 ppines, Cambodia & Vietnam 19 Thailand 11 South Asia 12 China 13 the Middle East 14 Japan 8 Fig. 5. Traveling location taxonomy. Fig. 6. Our proposed structure of collaborative filtering. 15348 D.-H. Shih et al. / Expert Systems with Applications 38 (2011) 15344–15355
D -H Shih et al Expert Systems with Applications 38(2011)15344-15355 GREENLAND Reykjavik A C PACIFIC RUSSIA OCEAN SAUDI ARA8144T4 (TAIWAN PAKIS E YEMEN OMAN ? ANMAR/ PHILIPPINES (Bombay) Mumbai Chi Minh t SRI LANKA INDIA N Colombo) la Lum OcEAN 000Km 一和 USTRALHA Fig. 7. An example of abroad locus. Table 2 IfW1≥W2≥…≥Wk,then Hong Kong and Macao 2 East Europ 3 Island in the pacific W1=(1+1/2+1/3…+1/k)/k W2=(1+1/2+1/3…+1/k)/k 4 Korea 5 New Zealand and 6 Malaysia and W3=(1+1/2+1/3…+1/k)/k, Australia 7 11 Thailand 12 South asia 14 the Middle East 15 Spain, Portugal and Wk=(1+1/2+1/3……+1/k)/k 16 North Europe More generally, if k is the number of attributes, then the weight of 19 Philippines, Cambodia the kth attribute is. and vietnam (1/k)∑(/ (1) Table 3 The support value of gender Table 4 contains weights calculated from Eq (1) for values of K from Gender Total 2 to 11. Partial rank order information (e.g. tied ranks, missing anks)can be handled, though the computational formulas are less 13,750 375 Female 22888 625 pretty. Barron and Barrett treat such cases, drawing their methods from Kmietowicz and Pearman (1984). and from there to specify its centroid. 3. 2. Phase Il for the weights have a convenient com- According to the result of the cluster, we can calculate the putational form. similarity between a user and the other users in the same cluster
with knowing the ranks, and from there to specify its centroid. Moreover, the equations for the weights have a convenient computational form. If W1 P W2 P P Wk, then W1 ¼ ð1 þ 1=2 þ 1=3 þ 1=kÞ=k; W2 ¼ ð1 þ 1=2 þ 1=3 þ 1=kÞ=k; W3 ¼ ð1 þ 1=2 þ 1=3 þ 1=kÞ=k; . . . Wk ¼ ð1 þ 1=2 þ 1=3 þ 1=kÞ=k: More generally, if k is the number of attributes, then the weight of the kth attribute is: Wk ¼ ð1=kÞ Xk i¼1 ð Þ l=i : ð1Þ Table 4 contains weights calculated from Eq. (1) for values of K from 2 to 11. Partial rank order information (e.g. tied ranks, missing ranks) can be handled, though the computational formulas are less pretty. Barron and Barrett treat such cases, drawing their methods from Kmietowicz and Pearman (1984). 3.2.2. Phase II According to the result of the cluster, we can calculate the similarity between a user and the other users in the same cluster. Fig. 7. An example of abroad locus. Table 2 Traveling location of vector. 1 Hong Kong and Macao 2 East Europe 3 Island in the Pacific Ocean 4 Korea 5 New Zealand and Australia 6 Malaysia and Singapore 7 Indonesia 8 Japan 9 America 10 Africa 11 Thailand 12 South Asia 13 China 14 the Middle East 15 Spain, Portugal and Morocco 16 North Europe 17 Canada 18 Midwest Europe 19 Philippines, Cambodia and Vietnam Table 3 The support value of gender. Gender Total Support (%) Male 13,750 37.5 Female 22,888 62.5 D.-H. Shih et al. / Expert Systems with Applications 38 (2011) 15344–15355 15349
D-H. Shih et aL/ Expert Systems with Applications 38(2011)15344-15355 RoC weights for indicated number of attributes. Number of attributes 0.7500 06111 03704 0.314 02778 0.2147 1234567890 00625 0.1106 00278 0.047 0.0156 0.0123 00083 Here cosine similarity (Margaret, 2003)would be adopted. It is a tells how to update or revise beliefs in light of new evidence of a quite common method in data mining. Through the method we posteriori. The probability of an event A conditional on an find the similarity between two traveling locations as follow- event b is generally different from the probability of b cond ing Eqs. (2)and (3). If t, t is locus of going abroad, sim(t, t) on A. However, there is a definite relationship between th means the similarity between the two locus and it can return a and Bayes theorem is the statement of that relationship. value between [0, 1 As a formal theorem, Bayes theorem is valid in all interpreta- t1=【t1,ta,…,tlk=1,2,…,n tions of probability. However, frequents and Bayesian interpreta- tions disagree about the kinds of things to which probabilities In sign t=[tn,t12,…,l]k=1,2,…,n to random events according to their frequencies of occurrence or to subsets of populations as proportions of the whole: Bayesians sim(ti, t) (2) assign probabilities to propositions that are uncertain. A conse- quence is that Bayesians have more frequent occasion to use Bayes theorem. The articles on Bayesian probability and frequents prob- ability discuss these debates at greater length sin(tt)=∑k-wn)+w(n) hw(2wp’ If weighted Bayes theorem relates the conditional and marginal probabil (3) ties of stochastic events a and B (B∩A) PB)P(B∩A1)+P(B∩A2)+…+P(B∩A) 3.3. Hybrid approach P(A1)P(BIA) P(B1)+P(A2)P(BA2)+…+P(k ZL(AB)P(A) Several recommendation systems use a hybrid approach by combining collaborative and content-based methods, which hel where L(aB)is the likelihood of A given fixed B. Although in this to avoid certain limitations of content-based and collaborative sys- case the relationship P(BIA)=L(AJB). in other cases likelihood L tems. Different ways to combine collaborative and content-based methods into a hybrid recommender system can be classified can be multiplied by a constant factor, so that it is proportional follows(Adomavicius Tuzhilin, 2005): to, but does not equal probability P For example, if there are 1000 people had been to Korea and ja- 1. Implementing collaborative and content-based methods sepa- pan, that is about 1/10 of the whole database. It means that prior ately and combining their predictions probability Papan and Korea)=0. 1. There are 400 people also 2. Incorporating some content-based characteristics into a collal had been to China among these people, 300 had been to Hong orative approach and Macao, 200 people had been to Thailand, and 100 people had 3. Incorporating some collaborative characteristics into a content been to Indonesia. If our target user had been to Korea, Japan, based approach. which traveling location should we recommend. According to con- 4. Constructing a general unifying model that incorporates both ditional probability in Table 5, if the result of CF is China, Hong content-based and collaborative characteristics Kong and Macao, PChinaapan and Korea)=0. 4 and P(Hong Kong and Macao Lapan and Korea)=0.3. Therefore the conditional prob- tent-based opt the fist type: implementing collabora- ability is P(Chinallapan and Korea)+P(Hong Kong and Macaolla- n this paper, we pan and Korea)=0.4+0.3=0.7. The same as above, the result of predictions methods separately and combining their cBF is P(Thailand Uapan and Korea)+P(IndonesialJapan and urthermore, we designed a process of recommendation follow- Korea)=0. 2+0.1=0.3. Obviously 0.7>0.3, hence the result of CF ing hybrid filtering shown as Fig 8. Every record would be pro would be the final result of hybrid method. On the contrary, the re- essed by content-based filtering and collaborative filtering, and sult of CBF would be the final result of hybrid method then each filtering would return a result. The problem is how we choose the result. In this paper we adopt Bayes theorem to calcu- 4. Experiment and evaluation late which one has higher posterior probabilities, the result on the higher one would be the final result of the hybrid method 4.1 Data set Bayes'theorem is a result in probability theory, which relates the conditional and marginal probability distributions of random For experiment, we adopted a real database, provided by lion variables. In some interpretations of probability, Bayes'theorem TRAVEL from Taiwan. The dataset contains records of people who
Here cosine similarity (Margaret, 2003) would be adopted. It is a quite common method in data mining. Through the method we can find the similarity between two traveling locations as following Eqs. (2) and (3). If ti, tj is locus of going abroad, sim(ti, tj) means the similarity between the two locus and it can return a value between [0, 1]. ti ¼ ½ti1;ti2; ... ;tik k ¼ 1; 2; ... ; n tj ¼ ½tj1;tj2; ... ;tjk k ¼ 1; 2; ... ; n simðti;tjÞ ¼ Pk h¼1tihtjh ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pk h¼1t 2 ih Pk h¼1t 2 jh q ; ð2Þ simðti;tjÞ ¼ Pk h¼1wbðtijÞ wbðtjhÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pk h¼1wbðtihÞ 2Pk h¼1wpðtjhÞ 2 q ; if weighted: ð3Þ 3.3. Hybrid approach Several recommendation systems use a hybrid approach by combining collaborative and content-based methods, which helps to avoid certain limitations of content-based and collaborative systems. Different ways to combine collaborative and content-based methods into a hybrid recommender system can be classified as follows (Adomavicius & Tuzhilin, 2005): 1. Implementing collaborative and content-based methods separately and combining their predictions. 2. Incorporating some content-based characteristics into a collaborative approach. 3. Incorporating some collaborative characteristics into a contentbased approach. 4. Constructing a general unifying model that incorporates both content-based and collaborative characteristics. In this paper, we adopt the fist type: implementing collaborative and content-based methods separately and combining their predictions. Furthermore, we designed a process of recommendation following hybrid filtering shown as Fig. 8. Every record would be processed by content-based filtering and collaborative filtering, and then each filtering would return a result. The problem is how we choose the result. In this paper we adopt Bayes’ theorem to calculate which one has higher posterior probabilities, the result on the higher one would be the final result of the hybrid method. Bayes’ theorem is a result in probability theory, which relates the conditional and marginal probability distributions of random variables. In some interpretations of probability, Bayes’ theorem tells how to update or revise beliefs in light of new evidence of a posteriori. The probability of an event A conditional on another event B is generally different from the probability of B conditional on A. However, there is a definite relationship between the two, and Bayes’ theorem is the statement of that relationship. As a formal theorem, Bayes’ theorem is valid in all interpretations of probability. However, frequents and Bayesian interpretations disagree about the kinds of things to which probabilities should be assigned in applications: frequents assign probabilities to random events according to their frequencies of occurrence or to subsets of populations as proportions of the whole; Bayesians assign probabilities to propositions that are uncertain. A consequence is that Bayesians have more frequent occasion to use Bayes’ theorem. The articles on Bayesian probability and frequents probability discuss these debates at greater length. Bayes’ theorem relates the conditional and marginal probabilities of stochastic events A and B: PðAijBÞ ¼ PðB \ AiÞ PðBÞ ¼ PðB \ AiÞ PðB \ A1Þ þ PðB \ A2Þþþ PðB \ AkÞ ¼ PðAiÞPðBjAiÞ PðA1ÞPðBjA1Þ þ PðA2ÞPðBjA2Þþþ PðAkÞPðBjAkÞ aLðAjBÞPðAÞ; where L(A|B) is the likelihood of A given fixed B. Although in this case the relationship P(B|A) = L(A|B), in other cases likelihood L can be multiplied by a constant factor, so that it is proportional to, but does not equal probability P. For example, if there are 1000 people had been to Korea and Japan, that is about 1/10 of the whole database. It means that prior probability P(Japan and Korea) = 0.1. There are 400 people also had been to China among these people, 300 had been to Hong Kong and Macao, 200 people had been to Thailand, and 100 people had been to Indonesia. If our target user had been to Korea, Japan, which traveling location should we recommend. According to conditional probability in Table 5, if the result of CF is China, Hong Kong and Macao, P(China|Japan and Korea) = 0.4 and P(Hong Kong and Macao|Japan and Korea) = 0.3. Therefore the conditional probability is P(China|Japan and Korea) + P(Hong Kong and Macao|Japan and Korea) = 0.4 + 0.3 = 0.7. The same as above, the result of CBF is P(Thailand|Japan and Korea) + P(Indonesia|Japan and Korea) = 0.2 + 0.1 = 0.3. Obviously 0.7 > 0.3, hence the result of CF would be the final result of hybrid method. On the contrary, the result of CBF would be the final result of hybrid method. 4. Experiment and evaluation 4.1. Data set For experiment, we adopted a real database, provided by LION TRAVEL from Taiwan. The dataset contains records of people who Table 4 ROC weights for indicated number of attributes. Weight Number of attributes Rank 2 3 4 5 6 7 8 9 10 11 1 0.7500 0.6111 0.5208 0.4567 0.4083 0.3704 0.3397 0.3143 0.2929 0.2745 2 0.2500 0.2778 0.2708 0.2567 0.2417 0.2276 0.2147 0.2032 0.1929 0.1836 3 0.1111 0.1458 0.1567 0.1583 0.1561 0.1522 0.1477 0.1429 0.1382 4 0.0625 0.0900 0.1028 0.1085 0.1106 0.1106 0.1096 0.1079 5 0.0400 0.0611 0.0728 0.0793 0.0828 0.0846 0.0851 6 0.0278 0.0442 0.0543 0.0606 0.0646 0.0670 7 0.0204 0.0335 0.0421 0.0479 0.0518 8 0.0156 0.0262 0.0336 0.0388 9 0.0123 0.0211 0.0275 10 0.0100 0.0174 11 0.0083 15350 D.-H. Shih et al. / Expert Systems with Applications 38 (2011) 15344–15355
D -H Shih et al /Expert Systems with Applications 38(2011)15344-15355 ng ing Begin Custome classification profi Dec⊥ sion Ru1e ecision Tree K-mean Recoemendation measurement Babesia probability commendation P(CF) 2 P(CBE No Recommend CBF Recommend CF Resul Fig 8. Our proposed structure of hybrid method. is used to measure the recommend Statistical information of traveling location. set T comprises S records, where T=ts. The associated class la- Traveling location Percentage(o) bels assigned by the classification module comprise a set c Then the recommendation accuracy is computed as following: Korea, Japan, China Korea, Japan, Hong Kong and macao Korea, Japan, Indonesia 1 s.t. vs= 1ifc∈cM∈T lo otherwis travel abroad from 2003 to 2005. Data after 2006 is not presented where a means the recommendation accuracy, namely the percent- due to privacy issue of law in Taiwan. In the dataset, the amount of age of the transactions in the test set that are correctly classified males is 217, 679(about 38%)and the amount of females is 346,438 v/Vs=1, 2, .. S is a binary variable where it is set to one, if trans- Apparently, the number of females in the database is almost dou- action ts Vs= 1, 2,., S is correctly classified and zero otherwise. ble the number of male Data Envelopment Analysis(DEA): Occasionally called frontier analysis, was first put forward by Charnes, Cooper, and Rhodes In order to overcome the sparsity problem, we chose the people (1978). It is a performance measurement technique which, as we who went abroad over three times. There are only 36638 records shall see, can be used for evaluating the relative efficiency of deci left including female 62.5% and male 37.5%. We cut the data into sion-making units(DMU's)in two parts. One is learning data (years 2003, 2004)and the other tinct unit within an organization that has flexibility with respect to some of the decisions it makes, but does not have complete free- dom with respect to these decisions 4.2. Evaluation metrics Examples of such units to which DEA has been applied banks, police stations, hospitals, tax offices, prisons, schools and To evaluate how accurately the proposed recommendation sys- university departments. Note here that one advantage of DEa is tem assigns traveling locations based on the three recommender that it can be applied to non-profit making organizations. In this methods. This research applies the accuracy measurement(Han paper, we adopt a simple way of DEA-cost-benefit analysis to ana- Kamber, 2001)to validate the system performance. a test set lyze the experiment result
travel abroad from 2003 to 2005. Data after 2006 is not presented due to privacy issue of law in Taiwan. In the dataset, the amount of males is 217,679 (about 38%) and the amount of females is 346,438 (about 62%). The majority people fall in age between 30 and 60. Apparently, the number of females in the database is almost double the number of males. In order to overcome the sparsity problem, we chose the people who went abroad over three times. There are only 36638 records left including female 62.5% and male 37.5%. We cut the data into two parts. One is learning data (years 2003, 2004) and the other is recommending test data (year 2005) (see Fig. 9 and Table 6). 4.2. Evaluation metrics To evaluate how accurately the proposed recommendation system assigns traveling locations based on the three recommender methods. This research applies the accuracy measurement (Han & Kamber, 2001) to validate the system performance. A test set is used to measure the recommendation accuracy. Suppose the test set T comprises S records, where T ¼ ftsg. The associated class labels assigned by the classification module comprise a set C tsx . Then the recommendation accuracy is computed as following: a ¼ Xs s¼1 Vs=S; s:t: Vs ¼ 1 if Cðts 2 C ts j8ts 2 T 0 otherwise; ( 8s ¼ 1; 2; ... ; s where a means the recommendation accuracy, namely the percentage of the transactions in the test set that are correctly classified. Vsj8s ¼ 1; 2; ... ; S is a binary variable where it is set to one, if transaction tsj8s ¼ 1; 2; ... ; S is correctly classified and zero otherwise. Data Envelopment Analysis (DEA): Occasionally called frontier analysis, was first put forward by Charnes, Cooper, and Rhodes (1978). It is a performance measurement technique which, as we shall see, can be used for evaluating the relative efficiency of decision-making units (DMU’s) in organizations. Here a DMU is a distinct unit within an organization that has flexibility with respect to some of the decisions it makes, but does not have complete freedom with respect to these decisions. Examples of such units to which DEA has been applied are: banks, police stations, hospitals, tax offices, prisons, schools and university departments. Note here that one advantage of DEA is that it can be applied to non-profit making organizations. In this paper, we adopt a simple way of DEA–cost-benefit analysis to analyze the experiment result. Fig. 8. Our proposed structure of hybrid method. Table 5 Statistical information of traveling location. Traveling location Quantity Percentage (%) Korea, Japan 1000 1 Korea, Japan, China 400 (0.4) 0.4 Korea, Japan, Hong Kong and Macao 300 (0.3) 0.3 Korea, Japan, Thailand 200 (0.2) 0.2 Korea, Japan, Indonesia 100 (0.1) 0.1 D.-H. Shih et al. / Expert Systems with Applications 38 (2011) 15344–15355 15351
15352 D-H. Shih et aL/ Expert Systems with Applications 38(2011)15344-15355 40000 the traveling locations. Clementine, 2s decision tree model is used 35000 to classify learning data based on customer performances. We 30000 adopt decision tree algorithm C5.0 to handle this job. The l put data is consist of gender, age, constellations, selling pl 口M output data that consists of traveling locations. Finally, it generate a decision tree model 一任量 Phase ll: We use recommending test data as the input, and the performance of customers as the output. Then the data will be pro- 10102020-303040405050606070>70To cessed by the generated decision tree model. Through the ago- rithm of the decision tree, the system will generate an output for each record. It is the fitting continents for the user. We can com Fig 9. Bar chart of bine the result with some marketing policies and the result would be the recommendation result. Top 2 means recommending top 2 traveling locations in every continent. The result shows in Table 7 Statistical result of data set Top 3 means recommending top 3 traveling locations in every con- tinent. The result shows in Table 8. Top 5 means recommending top 5 traveling locations in every continent. The result shows in Table 9. If column"2005 location"is covered in recommending 750828964612170828062994 result, hit column is set to one and zero otherwise 6 times Over 7 times 4.3. 2. Collaborative filtering(Case 4 and Case 5) 3663813,7502288882761366514.697 Phase I: At first, data is divided into two parts-learning data and recommending test data again. We adopt k-means algorithm to cluster preprocessed data according to the attributes- gender, age, constellations, selling place and locus of going abroad. The ba- 4.3. Experimental results sic input data consist of gender, age, constellations, selling place, locus of going abroad, and the output data is traveling locations. Fi- There are 7 Cases in our experiment which we are going to com- nally, it would generate a K- pare with. And, its detail is described at the following Phase Il: According to the result of the cluster, we can calculate he similarity between a user and the other users in the same clus- Case 1: CBF (Top 2)Recommending top 2 destinations using ter. Here cosine similarity (Margaret, 2003)would be adopted.It is ntent-based a quite common method in data mining. Through the method you Case 2: CBF(Top 3). Recommending top 3 destinations using can find the similarity between two loci. Top 1 means recommend- content-based filterir ing top 1 nearest neighbor in the same cluster as shown in Table 1 Case 3: CBF (Top 5) Recommending top 5 destinations usin Top 2 means recommending top 2 nearest neighbors in the same content-based filtering. cluster as shown in Table 11 Case 4: CF(Top 1) Recommending top 1 neighbor using collab orative filtering. Case 5: CF(Top 2) Recommending top 2 neighbors using collab- 4.3.3. Hybrid filtering(Case 6) By using hybrid filtering, every record would be processed by content-based filtering and collaborative filtering. Then each filter Case 6: Hybrid Combining CBF (Top 3)and CF (Top 2)filtering ing method will return a result. Then, we adopt Bayes'theorem to with Bayes'theorem decide, which one have higher posterior probabilities, the result on Case 7: Top 5 Destinations(Baseline) Recommending the most the higher one would be the finally result of hybrid method top 5 popular destinations simply by statistic. Table 12 shows the result of hybrid method. If CBF Bayesian prob- ability is higher than CF Bayesian probability the choose column is and recommending test data. According to the difference ng data set to CBF_Top3, and CF-Top2 otherwise 4.3.1. Content-based filtering(Case 1, Case 2 and Case 3) Phase I: At first, data is divided into two parts-learr Finally, according to statistical data of 2003 and 2004, the most popular top 5 places that travelers have been to is calculated as the the traveling locations, we adopt taxonomy to classify result of Case 7 for comparison. raveling locations into five continents(America, Oceania, After the experiment, we have the outcome of Table 13, Figs. 10- Asia, and Africa ). The next step is analyzing the relation 12. Table 13 shows that Case 1 to Case 6 has better performance Table 7 Result of CBF Top 2 recommendation. 304 Location Vector Taxonomy Decision tree Top 2 Korea pan, China China. China Malaysia and Singapore, Japan Australia Indonesia Japan, China Midwest Europe 144 Japan, china 1130 Midwest Europe Malaysia and Singapore, Chill nam, Indonesia 11 .0512 Thailand pan, china Midwest Europe, East Europe 0
4.3. Experimental results There are 7 Cases in our experiment which we are going to compare with. And, its detail is described at the following. Case 1: CBF (Top 2). Recommending top 2 destinations using content-based filtering. Case 2: CBF (Top 3). Recommending top 3 destinations using content-based filtering. Case 3: CBF (Top 5). Recommending top 5 destinations using content-based filtering. Case 4: CF (Top 1). Recommending top 1 neighbor using collaborative filtering. Case 5: CF (Top 2). Recommending top 2 neighbors using collaborative filtering. Case 6: Hybrid. Combining CBF (Top 3) and CF (Top 2) filtering with Bayes’ theorem. Case 7: Top 5 Destinations (Baseline). Recommending the most top 5 popular destinations simply by statistic. 4.3.1. Content-based filtering (Case 1, Case 2 and Case 3) Phase I: At first, data is divided into two parts – learning data and recommending test data. According to the difference between the traveling locations, we adopt taxonomy to classify different traveling locations into five continents (America, Oceania, Europe, Asia, and Africa). The next step is analyzing the relation between the traveling locations. Clementine7.2’s decision tree model is used to classify learning data based on customer performances. We adopt decision tree algorithm C5.0 to handle this job. The basic input data is consist of gender, age, constellations, selling place and output data that consists of traveling locations. Finally, it would generate a decision tree model. Phase II: We use recommending test data as the input, and the performance of customers as the output. Then the data will be processed by the generated decision tree model. Through the algorithm of the decision tree, the system will generate an output for each record. It is the fitting continents for the user. We can combine the result with some marketing policies and the result would be the recommendation result. Top 2 means recommending top 2 traveling locations in every continent. The result shows in Table 7. Top 3 means recommending top 3 traveling locations in every continent. The result shows in Table 8. Top 5 means recommending top 5 traveling locations in every continent. The result shows in Table 9. If column ‘‘2005 location’’ is covered in recommending result, hit column is set to one, and zero otherwise. 4.3.2. Collaborative filtering (Case 4 and Case 5) Phase I: At first, data is divided into two parts – learning data and recommending test data again. We adopt k-means algorithm to cluster preprocessed data according to the attributes – gender, age, constellations, selling place and locus of going abroad. The basic input data consist of gender, age, constellations, selling place, locus of going abroad, and the output data is traveling locations. Finally, it would generate a K-means model for use. Phase II: According to the result of the cluster, we can calculate the similarity between a user and the other users in the same cluster. Here cosine similarity (Margaret, 2003) would be adopted. It is a quite common method in data mining. Through the method you can find the similarity between two loci. Top 1 means recommending top 1 nearest neighbor in the same cluster as shown in Table 10. Top 2 means recommending top 2 nearest neighbors in the same cluster as shown in Table 11. 4.3.3. Hybrid filtering (Case 6) By using hybrid filtering, every record would be processed by content-based filtering and collaborative filtering. Then each filtering method will return a result. Then, we adopt Bayes’ theorem to decide, which one have higher posterior probabilities, the result on the higher one would be the finally result of hybrid method. Table 12 shows the result of hybrid method. If CBF Bayesian probability is higher than CF Bayesian probability, the choose column is set to CBF_Top3, and CF_Top2 otherwise. Finally, according to statistical data of 2003 and 2004, the most popular top 5 places that travelers have been to is calculated as the result of Case 7 for comparison. After the experiment, we have the outcome of Table 13, Figs. 10– 12. Table 13 shows that Case 1 to Case 6 has better performance Fig. 9. Bar chart of age. Table 6 Statistical result of data set. Times Record Male Female 2003 2004 2005 3 times 23,049 8375 14,674 5211 8548 9290 4 times 7508 2896 4612 1708 2806 2994 5 times 2880 1103 1777 685 1086 1109 6 times 1392 587 805 289 531 572 Over 7 times 1809 789 1020 383 694 732 Total 36,638 13,750 22,888 8276 13,665 14,697 Table 7 Result of CBF Top 2 recommendation. Id 0304 Location 05 Location Vector Taxonomy Decision tree Top 2 Hit ⁄ 0609 Korea Australia, Japan 4 4 4 Japan, China 1 ⁄ 0228 China China, China 13 4 4 Japan, China 1 ⁄ 1221 Thailand Malaysia and Singapore, Japan 11 4 4 Japan, China 1 ⁄ 1025 Indonesia Australia, Indonesia 7 4 4 Japan, China 0 ⁄ 0114 Japan Midwest Europe 8 4 4 Japan, China 0 ⁄ 0412 Canada Japan 17 1 1 America, Canada 0 ⁄ 0704 Thailand America, Japan 11 4 4 Japan, China 1 ⁄ 0510 Indonesia Thailand, Thailand 7 4 4 Japan, China 0 ⁄ 0512 Thailand Philippines, Cambodia and Vietnam, Indonesia 11 4 4 Japan, China 0 ⁄ 1130 Midwest Europe Malaysia and Singapore, China 6 3 3 Midwest Europe, East Europe 0 ... ... ... ... ... ... ... ... 15352 D.-H. Shih et al. / Expert Systems with Applications 38 (2011) 15344–15355
D -H Shih et aL / Expert Systems with Applications 38(2011)15344-15355 15353 oe 8 Result of CBF Top 3 recommendation. 0304 Location ector Taxonomy Decision tree Top 3 d Singapore, Japan Midwest Europe 444 pan, china, thailand pan, china, thailand merica. Canada apan, China, Thailand odia and vietnam, Indonesia 11 pan, China, Thailand 1130 Midwest Europe Malays idwest Europe, East Europe orth Europe, of CBF Top 5 recommendation. 0304 Location 05 Location Vector Taxonomy Decision tree Top 5 Australia, Japan Japan, China, Thailand, Indonesia, Korea 1 apan, CI malaysia and Singapore, Japan apan, China, Thailand, ndonesia apan, China, Thailand, apan, China, Thailand, Indonesia, Thailand. Thailand apan, China, Tailand Indonesia Korea Midwest Europe Malaysia and Singapore, China Midwest Europe, East Europe North Europe, than Case 7 which just recommends the most popular top 5 trav- line. The error rate of Case 7 is 53.5% and correct rate is 46.5% eling locations. The error rate is lower and correct rate is higher. As to Case 6 the error rate is 25.810% and correct rate is 74. 190%. It also shows that all of the methods we proposed are more effec Obviously, Case 6 is the best choice as shown in Fig. 12. From the tive than Case 7. If we focus on some target user, the results must viewpoint of slope we discover that the slope of Case 6 is much e outstanding. In these methods, hybrid has the highest correct higher than Case 7. Accordingly, the higher slope means the lower and lowest error rate. It reveals that hybrid indeed integrates the error rate. In the end, we conclude that choosing the line with the advantage of content-based filtering and collaborative filtering. highest slope. It is definitely the best choice. Therefore, Case 6-hy For that reason, hybrid is more accurate than the other methods. brid method performances best in our experimen According to the result of Fig. 12, we use CBa method which is defined by richard Layard to evaluate all the methods. In order to find the highest correct case during the endurable error rate, sup- 11 pose that there exists one line going through per case and the ori Result of CF Top 2 recommendation. gin. For example, both Case 6 and Cas straigh Neighbor Top 2 Mean 0437 Table 10 Korea, China, Japan Result of CF Top 1 recommendation 05 Location Neighbor Top 1 lew Zealand and Australia, Malaysia 509 Midwest Europe Vietnam, Japan d Singapore, Thailand, china 1025 Korea 15 0412 Japan, China Korea, Japan 0510 Thailand Thailand 8 Thailand Thailand 5903 Thailand 1435
than Case 7 which just recommends the most popular top 5 traveling locations. The error rate is lower and correct rate is higher. It also shows that all of the methods we proposed are more effective than Case 7. If we focus on some target user, the results must be outstanding. In these methods, hybrid has the highest correct and lowest error rate. It reveals that hybrid indeed integrates the advantage of content-based filtering and collaborative filtering. For that reason, hybrid is more accurate than the other methods. According to the result of Fig. 12, we use CBA method which is defined by Richard Layard to evaluate all the methods. In order to find the highest correct case during the endurable error rate, suppose that there exists one line going through per case and the origin. For example, both Case 6 and Case 7 both have the straight line. The error rate of Case 7 is 53.5% and correct rate is 46.5%. As to Case 6 the error rate is 25.810% and correct rate is 74.190%. Obviously, Case 6 is the best choice as shown in Fig. 12. From the viewpoint of slope, we discover that the slope of Case 6 is much higher than Case 7. Accordingly, the higher slope means the lower error rate. In the end, we conclude that choosing the line with the highest slope. It is definitely the best choice. Therefore, Case 6 – hybrid method performances best in our experiment. Table 8 Result of CBF Top 3 recommendation. id 0304 Location 05 Location Vector Taxonomy Decision tree Top 3 Hit ⁄ 0609 Korea Australia, Japan 4 4 4 Japan, China, Thailand 1 ⁄ 0228 China China, China 13 4 4 Japan, China, Thailand 1 ⁄ 1221 Thailand Malaysia and Singapore, Japan 11 4 4 Japan, China, Thailand 1 ⁄ 1025 Indonesia Australia, Indonesia 7 4 4 Japan, China, Thailand 0 ⁄ 0114 Japan Midwest Europe 8 4 4 Japan, China, Thailand 0 ⁄ 0412 Canada Japan 17 1 1 America, Canada 0 ⁄ 0704 Thailand America, Japan 11 4 4 Japan, China, Thailand 1 ⁄ 0510 Indonesia Thailand, Thailand 7 4 4 Japan, China, Thailand 1 ⁄ 0512 Thailand Philippines, Cambodia and Vietnam, Indonesia 11 4 4 Japan, China, Thailand 0 ⁄ 1130 Midwest Europe Malaysia and Singapore, China 6 3 3 Midwest Europe, East Europe North Europe, 0 ... ... ... ... ... ... ... ... Table 9 Result of CBF Top 5 recommendation. Id 0304 Location 05 Location Vector Taxonomy Decision tree Top 5 Hit ⁄ 0609 Korea Australia, Japan 4 4 4 Japan, China, Thailand, Indonesia, Korea 1 ⁄ 0228 China China, China 13 4 4 Japan, China, Thailand, Indonesia, Korea 1 ⁄ 1221 Thailand Malaysia and Singapore, Japan 11 4 4 Japan, China, Thailand, Indonesia, Korea 1 ⁄ 1025 Indonesia Australia, Indonesia 7 4 4 Japan, China, Thailand, Indonesia, Korea 1 ⁄ 0114 Japan Midwest Europe 8 4 4 Japan, China, Thailand, Indonesia, Korea 0 ⁄ 0412 Canada Japan 17 1 1 America, Canada 0 ⁄ 0704 Thailand America, Japan 11 4 4 Japan, China, Thailand, Indonesia, Korea 1 ⁄ 0510 Indonesia Thailand, Thailand 7 4 4 Japan, China, Thailand, Indonesia, Korea 1 ⁄ 0512 Thailand Philippines, Cambodia and Vietnam, Indonesia 11 4 4 Japan, China, Thailand, Indonesia, Korea 0 ⁄ 1130 Midwest Europe Malaysia and Singapore, China 6 3 3 Midwest Europe, East Europe North Europe, 0 ... ... ... ... ... ... ... ... Table 10 Result of CF Top 1 recommendation. Id 05 Location kMean Neighbor Top 1 Hit ⁄ 0609 Midwest Europe 1 ⁄ 0437 Canada, Japan 0 ⁄ 0228 Japan 9 ⁄ 1026 Korea, China 0 ⁄ 1221 New Zealand and Australia, Malaysia and Singapore, Philippines, Cambodia and Vietnam, Japan 1 ⁄ 8880 Philippines, Cambodia and Vietnam, Korea, Malaysia and Singapore, Thailand 1 ⁄ 1025 Korea 15 ⁄ 7079 Hong Kong and Macao, Japan 0 ⁄ 0114 Midwest Europe 4 ⁄ 1016 Korea 0 ⁄ 0412 Japan, China 4 ⁄ 8976 Korea 0 ⁄ 0704 Thailand 7 ⁄ 9226 Japan 0 ⁄ 0510 Thailand, Thailand 8 ⁄ 5903 Thailand 1 ... ... ... ... ... ... Table 11 Result of CF Top 2 recommendation. Id 05 Location kMean Neighbor Top 2 Hit ⁄ 0609 Midwest Europe 1 ⁄ 0437, ⁄ 1716 Canada, Japan 0 ⁄ 0228 Japan 9 ⁄ 1026, ⁄ 2737 Korea, China, Japan 1 ⁄ 1221 New Zealand and Australia, Malaysia and Singapore, Philippines, Cambodia and Vietnam, Japan 1 ⁄ 8880, ⁄ 2025 Philippines, Cambodia and Vietnam, Korea, Malaysia and Singapore, Thailand 1 ⁄ 1025 Korea 15 ⁄ 7079, ⁄ 2025 Hong Kong and Macao, Japan, Thailand, China 0 ⁄ 0114 Midwest Europe 4 ⁄ 1016, ⁄ 686 Korea, 0 ⁄ 0412 Japan, China 4 ⁄ 8976, ⁄ 3078 Korea, Japan 1 ⁄ 0704 Thailand 7 ⁄ 9226, ⁄ 9280 Korea, Japan 0 ⁄ 0510 Thailand, Thailand 8 ⁄ 5903, ⁄ 1435 Thailand 1 ... ... ... ... ... ... D.-H. Shih et al. / Expert Systems with Applications 38 (2011) 15344–15355 15353