2010 IEEE International Conference on Data Mining Workshops Parallel User profiling based on folksonomy for Large Scaled Recommender An implimentation of Cascading MapReduce Huizhi liang Jim Hogan Yue Xu Faculty of Science and Technology Faculty of Science and Technology Faculty of Science and Techn Queensland University of Technology Queensland University of Technology Queensland University of Technology Brisbane, Queensland, Australia Brisbane, Queensland, Australia Brisbane, Queensland, Australia oklianghuizi(gmail. com ]. hogan@qut. edu.au exu @qut.edu.au Abstrack-The Large scaled emerging user created information Hadoop, MapReduce and Cascading as well as the in web 2.0 such as tags, reviews, comments and blogs can be availability of public cloud computing services such as used to profile users' interests and preferences to make Amazon EC24 it becomes easy and convenient for personalized recommendations. To solve the sealability users with little or no cloud computing knowled problem of the current user profiling and recommender implement and run customer applications with large systems, this paper proposes a parallel user profiling approach datasets on clouds quickly and a scalable recommender svstem. The current advanced In this paper, we propose a parallel user profiling cloud computing techniques including Hadoop, MapReduce approach based on folksonomy information using the Cascading are employed to imple ment the proposed advanced cloud computing techniques. In addition,a approaches. The experiments were conducted on Amazon EC2 paralleled scalable recommender system implemented based Elastic MapReduce and s3 with a real world large sealed on Cascading MapReduce is also proposed. This paper is dataset from Del icio, us website organized as follows. Firstly, the related work is briefly reviewed in Section II. Then, some important definitions are Keywords-User Profiling Large Scales Recommender given in Section Ill. The proposed approaches are discussed Systems; Cloud Computing; Tags; Folksonomy, Web 2.0 . INTRODUCTION recommendation making approaches designed based on The information overload becomes a very serious issue Cascading MapReduce are presented. In Section V, the design of the experiments, experimental results and Personalization can provide effective solutions to deal with discussions are presented. The conclusions and future work nformation overload. As the base of personalization, the are discussed in Section Vi accuracy and efficiency of web user profiling affects the performances of recommender systems and other IL. RELATED WORK personalization systems greatly With the purpose of improving the accuracy and A. Recommender systems effectiveness of web user profiles, it's very important to Recommender systems have been an active research area explore novel approaches to profile users based on the for more than a decade, and many different techniques and emerging user created textural information in web 2.0 [2]. systems with distinct strength have been developed The emerging user created information such as tags, reviews, Collaborative filtering recommendation approach is comments and blogs implies user's important interests and popularly used. Usually, recommendation making can be preferences information. Thus, these kinds of information divided into three steps: user profiling, neighborhood can be used to profile users [2]. Based on the generated user forming and recommendation generation [1]. The user profiles, recommender systems can make personalized profiling is to profile each user's interests or preferences commendations to users to deal with the information The second step is to find similar users for each user based overload issue. The scalability problem is one of the major on the generated user profiles. Then, the nearest or most issues for recommender systems [1]. with the rapid growth similar users' items will be treated as candidate items. The of online communities, there are a large number of users and most popular items in the neighbors users will be items in these communities. It brings challenges to profile recommend to the target users Recently, how to make recommendations based on to profile the large amount of users efficiently and desi folksonomy or tag information becomes an important scalable recommender systems are very important research focus [2] Our rk [7 proposed D Parallel computing is an effective way to solve the approach to profile users' preferences based on tags and oblem of high computation complexity of large scaled platformstofacilitateparallelcomputingWiththe2http:/hadoop.apacheorg/mapreduce/ developmentofcloudcomputingsoftwareandtoolssuchashttp://ww.cascading.org http://aws.amazoncom/ec2/ 978-0-7695-42577/1052600◎2010IEEE DOI10.1l09/CDMW2010.161
Parallel User profiling based on folksonomy for Large Scaled Recommender Systems An implimentation of Cascading MapReduce Huizhi Liang Faculty of Science and Technology Queensland University of Technology Brisbane, Queensland, Australia oklianghuizi@gmail.com Jim Hogan Faculty of Science and Technology Queensland University of Technology Brisbane, Queensland, Australia j.hogan@qut.edu.au Yue Xu Faculty of Science and Technology Queensland University of Technology Brisbane, Queensland, Australia yue.xu @qut.edu.au Abstract—The Large scaled emerging user created information in web 2.0 such as tags, reviews, comments and blogs can be used to profile users’ interests and preferences to make personalized recommendations. To solve the scalability problem of the current user profiling and recommender systems, this paper proposes a parallel user profiling approach and a scalable recommender system. The current advanced cloud computing techniques including Hadoop, MapReduce and Cascading are employed to implement the proposed approaches. The experiments were conducted on Amazon EC2 Elastic MapReduce and S3 with a real world large scaled dataset from Del.icio.us website. Keywords-User Profiling; Large Scales Recommender Systems; Cloud Computing; Tags; Folksonomy; Web 2.0 I. INTRODUCTION The information overload becomes a very serious issue. Personalization can provide effective solutions to deal with information overload. As the base of personalization, the accuracy and efficiency of web user profiling affects the performances of recommender systems and other personalization systems greatly. With the purpose of improving the accuracy and effectiveness of web user profiles, it’s very important to explore novel approaches to profile users based on the emerging user created textural information in web 2.0 [2]. The emerging user created information such as tags, reviews, comments and blogs implies user’s important interests and preferences information. Thus, these kinds of information can be used to profile users [2]. Based on the generated user profiles, recommender systems can make personalized recommendations to users to deal with the information overload issue. The scalability problem is one of the major issues for recommender systems [1]. With the rapid growth of online communities, there are a large number of users and items in these communities. It brings challenges to profile users and make recommendations efficiently. Therefore, how to profile the large amount of users efficiently and design scalable recommender systems are very important. Parallel computing is an effective way to solve the problem of high computation complexity of large scaled datasets. Cloud computing provides software and hardware platforms to facilitate parallel computing. With the development of cloud computing software and tools such as Hadoop 1 , MapReduce 2 and Cascading 3 as well as the availability of public cloud computing services such as Amazon EC24 , it becomes easy and convenient for general users with little or no cloud computing knowledge to implement and run customer applications with large scaled datasets on clouds quickly. In this paper, we propose a parallel user profiling approach based on folksonomy information using the advanced cloud computing techniques. In addition, a paralleled scalable recommender system implemented based on Cascading MapReduce is also proposed. This paper is organized as follows. Firstly, the related work is briefly reviewed in Section II. Then, some important definitions are given in Section III. The proposed approaches are discussed in Section IV, where the parallel user profiling and recommendation making approaches designed based on Cascading MapReduce are presented. In Section V, the design of the experiments, experimental results and discussions are presented. The conclusions and future work are discussed in Section VI. II. RELATED WORK A. Recommender systems Recommender systems have been an active research area for more than a decade, and many different techniques and systems with distinct strength have been developed. Collaborative filtering recommendation approach is popularly used. Usually, recommendation making can be divided into three steps: user profiling, neighborhood forming and recommendation generation [1]. The user profiling is to profile each user’s interests or preferences. The second step is to find similar users for each user based on the generated user profiles. Then, the nearest or most similar users’ items will be treated as candidate items. The most popular items in the neighbors users will be recommend to the target users. Recently, how to make item recommendations based on folksonomy or tag information becomes an important research focus [2]. Our previous work [7] proposed an approach to profile users’ preferences based on tags and 1 http://hadoop.apache.org/ 2 http://hadoop.apache.org/mapreduce/ 3 http://www.cascading.org 4 http://aws.amazon.com/ec2/ 2010 IEEE International Conference on Data Mining Workshops 978-0-7695-4257-7/10 $26.00 © 2010 IEEE DOI 10.1109/ICDMW.2010.161 154
make personalized item recommendations. But how to make simplicity and user friendly, Cascading has been officially the user profiling and recommendation approaches more supported by Amazon Elastic MapReduce efficient and scalable still remains open research questions The key concepts of Cascading include Flow, Pipe, Tap The scalability problem is an important issue for Cascade, and others [6]. A pipeline is called a pipe assembly ecommender systems [1]. The large number of users, items Before a pipe assembly can be executed, it must be bound to and other information in real life online communities bring data sources and data sinks, called Taps. The process of challenges to recommender systems. How to design and binding pipe assemblies to sources and sinks results in a implement scalable recommender systems is very important. Flow Flows can be executed on a data cluster like Hadoop Mahout is a scalable open source recommender system, The collection of Flows is called a Cascade. There are five which is implemented based on Hadoop and MapReduce. Pipe types: Pipe, Each, Group By, CoGroup, Every, and However, it only uses users' explicit ratings to make SubAssembly [6]. The definitions and more detailed recommendation and fails to profile users'based on implicit information are in [6] ratings or other useful emerging information such as tags in The Cascading flow will be converted into Map Reduce jobs that can be executed on a hadoop cluster with an inner tly, some paralleled data and MapReduce Job Planner. Figure 1 shows how a reasonably knowledge discovery approaches that based on cloud normal Flow would be partitioned into MapReduce jobs [6] similarity calculation [4], clustering [9] and random walk [8 large scale data using MapReduce. However, it didn't use tag information to profile users Cloud computing Apache Hadoop is a Java software framework that Figure 1. A Cascading flow and MapReduce jobs supports data-intensive distributed applications under a free CThis graph is from the user guide book of Cascading 6) license. It enables applications to work with thousands of II. DEfinItions Google's MapReduce and Google File System(GFS) papers To describe the proposed approach, we define some ke used by a community of contributors from all over the world concepts and entities used in this paper as below ch as an Users:U=(u1, u2,,uu contains all users in an MapReduce is a framework for processing huge dataset online community who have used tags to label and on certain kinds of distributable problems using a large number of computers(nodes), collectively referred to as a P luster. Computational processing can occur on data stored Items (ie, Products, Resources either in a filesystem (unstructured) or within a database (P1 P2,,PPI) contains all items tagged by users in (structured). The Map Reduce framework consists of two U. Items could be any type of resources or products parts: Map and Reduce. For the Map part, the master node in an online community such as web pages, videos music tracks, photos academic papers, documents takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do and books etc this again in turn, leading to a multi-level tree structure. The Tags: T=[t1 t2,, tir contains all tags used by worker node processes that smaller problem, and passes the users in U. a tag is a piece of textural information answer back to its master node. For the Reduce part, the given by one or more users to label or collect items master node takes the answers to all the sub-problems and Profile: T()E combines them in a way to get the output -the answer to the represents each user uts preferences to tags problem it was originally trying to solve Wix is the preference weight of user ut Cascading is a feature rich api for defining and executing complex, scale-free, and fault tolerant data rocessing workflows on a Hadoop cluster. The processing In this paper, we focus on the top Nitem API lets the developer quickly assemble complex distributed recommendation task. Let ui E U be a target user, Pui be the processes without having to"think"in MapReduce and to item set that the user u already has, Pk E P- Pui be a efficiently schedule them based on their dependencies and candidate item, dAqu, pk)be other available meta-data. The Cascading processing model much user ui would be interested in the item pk, the problem is based on a"pipes and filters"metaphor. The developer of item recommendation is defined as generating a set of uses the Cascading API to assemble pipelines that split, ordered items pu , Pm E P-Pu to the use ui, where merge,group,or join streams of data while applying A(u, pu2.2 A(ui, pm, Shttp://ucene.apache.org/mahout/
make personalized item recommendations. But how to make the user profiling and recommendation approaches more efficient and scalable still remains open research questions. The scalability problem is an important issue for recommender systems [1]. The large number of users, items and other information in real life online communities bring challenges to recommender systems. How to design and implement scalable recommender systems is very important. Mahout 5 is a scalable open source recommender system, which is implemented based on Hadoop and MapReduce. However, it only uses users’ explicit ratings to make recommendation and fails to profile users’ based on implicit ratings or other useful emerging information such as tags in web 2.0. More recently, some paralleled data mining and knowledge discovery approaches that based on cloud computing techniques were proposed, such as paralleled similarity calculation [4], clustering [9] and random walk [8]. The work in [5] explored how to extract user profiles from large scale data using MapReduce. However, it didn’t use tag information to profile users. B. Cloud computing Apache Hadoop is a Java software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers [3]. Hadoop is a top-level Apache project, being built and used by a community of contributors from all over the world such as Amazon, Google, Yahoo and others. MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured). The MapReduce framework consists of two parts: Map and Reduce. For the Map part, the master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node. For the Reduce part, the master node takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve. Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster. The processing API lets the developer quickly assemble complex distributed processes without having to "think" in MapReduce and to efficiently schedule them based on their dependencies and other available meta-data. The Cascading processing model is based on a "pipes and filters" metaphor. The developer uses the Cascading API to assemble pipelines that split, merge, group, or join streams of data while applying operations to each data record or groups of records. As its 5 http://lucene.apache.org/mahout/ simplicity and user friendly, Cascading has been officially supported by Amazon Elastic MapReduce. The key concepts of Cascading include Flow, Pipe, Tap, Cascade, and others [6]. A pipeline is called a pipe assembly. Before a pipe assembly can be executed, it must be bound to data sources and data sinks, called Taps. The process of binding pipe assemblies to sources and sinks results in a Flow. Flows can be executed on a data cluster like Hadoop. The collection of Flows is called a Cascade. There are five Pipe types: Pipe, Each, GroupBy, CoGroup, Every, and SubAssembly [6]. The definitions and more detailed information are in [6]. The Cascading flow will be converted into MapReduce jobs that can be executed on a Hadoop cluster with an inner MapReduce Job Planner. Figure 1 shows how a reasonably normal Flow would be partitioned into MapReduce jobs [6]. III. DEFINITIONS To describe the proposed approach, we define some key concepts and entities used in this paper as below. • Users: ܷ ൌ ሼݑଵ, ݑଶ,…,ݑ||ሽ contains all users in an online community who have used tags to label and organize items. • Items (i.e., Products, Resources): ܲ ൌ ሼଵ, ଶ,…,||ሽ contains all items tagged by users in U. Items could be any type of resources or products in an online community such as web pages, videos, music tracks, photos, academic papers, documents and books etc. • Tags: ܶ ൌ ሼݐଵ, ݐଶ,…,ݐ|்|ሽ contains all tags used by users in U. A tag is a piece of textural information given by one or more users to label or collect items. • User Profile: ࣮௨ሺݑሻ ൌ൏ ݓ,௫,…,ݓ |்|, represents each user ݑ‘ s preferences to tags <ݐଵ,…,ݐ .<|்|ݓ,௫ is the preference weight of user ݑ to tag ݐ௫, In this paper, we focus on the top N item recommendation task. Let ݑ א ܷ be a target user, ܲ௨ be the item set that the user ݑ already has, אܲെܲ௨ be a candidate item, ࣛሺݑ ,ሻ be the prediction score of how much user ݑ would be interested in the item ,the problem of item recommendation is defined as generating a set of ordered items ,..., אܲെܲ௨ to the use ݑ , where .ሻ ,ݑሺࣛ ... ሻ ,ݑሺࣛ Figure 1. A Cascading flow and MapReduce jobs (This graph is from the user guide book of Cascading [6]) 155
THE PROPOSED APPROACHES The work [7] proposed Item A. User profiling and recommendation model recommendation approaches. In this paper, we discus paralleled user based collaborative filtering recommendation User profile is used to describe user's interests and approach. The formula that measures the cosine similarity of preferences information. Typically, an explicit or implicit two users is shown as below rating vector is used in collaborative filtering based recommender systems to profile a user's preferences or 1LZ, nterests to the items. With folksonomy information, we can (2) use a set of tags with their correspondent weights to profile users' topic interests [7]. The work in [7 discussed how to The K nearest neighbor users who have similar user calculate a user's preferences to a tag. In this paper,we with discuss how to parallel the user profiling process proposed nN(uD=(u lu E maxk[sim(ui, uD), u EU Let ui be a user a tag, each user ui can be profiled function maxk 0 returns the top K most similar users to vector J“(u) A set of items that are most frequently tagged by the Tu(ui=for tags . Where neighbors ot the target user or most simar to the target Wi. is the preference weight of ut to tx, Wi.x can be calculated by formula below: For each target user ui, a set of candidate items will be generated from the items tagged by ui's neighbourhood Wix=Pud 2pkEPu Pr(tx I pk). iuf(tx) (1) formed based on the similarity of user profiles. The prediction score for each candidate item pk denoted as 6. Where IPu; lis the number of items that has been collected Aqui, Pk)can be calculated as Pr(tx l pk)is the probability of tag tx being use tag item pk, given item Pk. This formula is a simplified version of Formula(7)in [7]. Pr(tx I pr) Where Upr is the users that tagged item Pk and N(uin Up is the selected neighbor users N(ui who have tagged the IUpk.II is the number of users that have used tag tx to tag item pk item P and JUp, is the number of users that have tagged The recommendation algorithm is shown as below: item pk iuf(tx)is the inverse user frequency of tag tx over Table 2. Recommendation Algorithm the whole user set. iuf (tx)=1/log(e+ JUt_), where e is an irrational constant approximately equal to 2.72 and Algorithm 2: Recommendation 0 6: Get prediction score based on Equation 3 commend top n items to 2: For each item pk∈P 8: End 3: For each tag tr E Tpr, the tag of item pk Get Pr(t, Ipx), Pr(t I pk)=Pk.x B. Large Scale Implementation 4 To solve the problem of high computation complexity of 5: For each user ui∈U user profiling and recommendation making, we discuss how 6: Get Pul, the total number of tagged items of ui to parallel Algorithm I and Algorithm 2 in this section. 7: For each tag t∈T Since Cascading can largely simplify the complexities with 8 Get XpyEP Pr(tx IPx) Hadoop application development, job creation and job 9: Get UtI scheduling, we implemented the proposed application with 10: Get iuf (t )=1/log(e + IUt D Cascading The recommendation process illustrated in Algorithm 2 Get Wi x Pu 2pxEPu Pr(tx I Px).iuf(tx) can be divided into three steps: user profiling, neighborhood 12: End orming and recommendation making. Thus, we firstly discuss how to parallel each step individually. Then,we discuss how to parallel the three steps. The three steps can be implemented by three flows in Cascading
IV. THE PROPOSED APPROACHES A. User profiling and recommendation model User profile is used to describe user's interests and preferences information. Typically, an explicit or implicit rating vector is used in collaborative filtering based recommender systems to profile a user’s preferences or interests to the items. With folksonomy information, we can use a set of tags with their correspondent weights to profile users’ topic interests [7]. The work in [7] discussed how to calculate a user’s preferences to a tag. In this paper, we discuss how to parallel the user profiling process proposed in [7]. Let ݑ be a user, ݐ௫ be a tag, each user ݑ can be profiled by a |T|-sized tag vector ࣮௨ሺݑሻ . ࣮௨ሺݑሻ ൌ൏ ݓ,௫,…,ݓ |்|,for tags <ݐଵ,…,ݐ .<|்|Where ݓ,௫ is the preference weight of ݑ to ݐ௫ , ݓ,௫ can be calculated by formula below: ݓ,௫ ൌ ଵ |ೠ | · ∑ ݎ࣪ሺݐ௫ | ሻ ೖאೠ · ݅ݑ݂ሺݐ௫ሻ (1) Where |ܲ௨ |is the number of items that has been collected by ݑ .ݎ࣪ሺݐ௫ | ሻ is the probability of tag ݐ௫ being used to tag item , given item . This formula is a simplified version of Formula (7) in [7]. ݎ࣪ሺݐ௫ | ሻ= |ೖ,ೣ| |ೖ| , where |ܷೖ,௧ೣ| is the number of users that have used tag ݐ௫ to tag item and |ܷೖ| is the number of users that have tagged item ݅ .ݑ݂ሺݐ௫ሻ is the inverse user frequency of tag ݐ௫ over the whole user set. ݅ݑ݂ሺݐ௫ሻ ൌ 1/݈݃ሺ݁ |ܷ௧ೣ|ሻ, where ݁ is an irrational constant approximately equal to 2.72 and 0 ൏ ݅ݑ݂ሺݐ௫ሻ 1. The algorithm of user profiling based on folksonomy is shown as below: Table 1. User Profiling Algorithm Algorithm 1: User Profiling Input: users’ tagging records [ ݑ ,ݐ௫, [ Output: user profile ࣮௨ሺݑሻ ൌ൏ ݓ,௫,…,ݓ |்|, 1: Begin 2: For each item א ܲ 3: For each tag ݐ௫ אܶ ೖ, the tag of item 4: Get ݎ࣪ሺݐ௫ | ሻ, ݎ࣪ሺݐ௫ | ሻ= |ೖ,ೣ| |ೖ| 5: For each user ݑ א ܷ 6: Get |ܲ௨ |, the total number of tagged items of ݑ 7: For each tag ݐ௫ אܶ 8: Get ∑ ݎ࣪ሺݐ௫ | ሻ ೖאೠ 9: Get |ܷ௧ೣ| 10: Get ݅ݑ݂ሺݐ௫ሻ ൌ 1/݈݃ሺ݁ |ܷ௧ೣ|ሻ 11: Get ݓ,௫, ଵ |ೠ | · ∑ ݎ࣪ሺݐ௫ | ሻ ೖאೠ · ݅ݑ݂ሺݐ௫ሻ 12: End The work [7] proposed two hybrid user and item based recommendation approaches. In this paper, we discuss the paralleled user based collaborative filtering recommendation approach. The formula that measures the cosine similarity of two users is shown as below. ݏ݉݅൫ݑ , ݑ൯ ൌ ∑ ௪,·௪ೕ, || సభ ටሺ∑ ௪,మሻ·ሺ∑ ௪ೕ, || మ సభ ሻ || సభ (2) The K nearest neighbor users who have similar user profiles with ݑ is denoted as where , ሽ ܷ א ݑ ,ൟ൯ݑ , ݑ൫݅݉ݏ൛ܭݔܽ݉ ߳ݑ|ݑሼ ൌ ሻݑŇሺ function maxK () returns the top K most similar users to ݑ . A set of items that are most frequently tagged by the neighbors of the target user or most similar to the target user’s items will be recommended to the target user. For each target user ݑ ,a set of candidate items will be generated from the items tagged by ݑ' s neighbourhood formed based on the similarity of user profiles. The prediction score for each candidate item denoted as ࣛሺݑ ,ሻ can be calculated as: ࣛሺݑ ,ሻ ൌ ∑ ݏ݉݅൫ݑ , ݑ൯ ௨ೕఢŇሺ௨ሻתೖ (3) Where ܷೖ is the users that tagged item and Ňሺݑሻ תܷ ೖ is the selected neighbor users Ňሺݑሻ who have tagged the . item The recommendation algorithm is shown as below: Table 2. Recommendation Algorithm Algorithm 2: Recommendation Input: users’ tagging records [ ݑ ,ݐ௫, [ Output: recommended item list ,..., אܲെܲ௨ to each ሻ ,ݑሺࣛ ... ሻ ,ݑሺࣛ where, ݑ user 1: Begin 2: For each user ݑ אܷ 3: Get user profile based on Equation 1 4: Get neighbor users based on Equation 2 5: For each candidate item 6: Get prediction score based on Equation 3 7: Recommend top N items to user ݑ 8: End B. Large Scale Implementation To solve the problem of high computation complexity of user profiling and recommendation making, we discuss how to parallel Algorithm 1 and Algorithm 2 in this section. Since Cascading can largely simplify the complexities with Hadoop application development, job creation and job scheduling, we implemented the proposed application with Cascading. The recommendation process illustrated in Algorithm 2 can be divided into three steps: user profiling, neighborhood forming and recommendation making. Thus, we firstly discuss how to parallel each step individually. Then, we discuss how to parallel the three steps. The three steps can be implemented by three flows in Cascading. 156
[(pk,tr), u] tx)Pr(tx|Pk),同P lustr,P] [(PR,t,), Pr(tr IPr)] e Pr(t I pk)l Pipe: pltemTag Pipe: uSer Tag [(ui, t), WixI ap: tSource t SinkI PIpe:pLine Pipe: uSerpRofile input Path/input Files Pipe: pTagluf [tx, iuf(uol s,PuD [(ui, pr), tr (tx),{ul引 GB: Group By Flow: fUser Profiling CoG: CoGroupBy Figure 2. Paralleled Data Flow of User Profiling Step 1: User Profiling The Pipe pLine splits each line of each input file. The process. The calculation of each user's preference weight to iter plem Tag calcul anes te probability of each tag for each Figure 2 illustrates the proposed paralleled user profiler each tag shown in Algorithm I can be implemented by Flow the number of tagged items of each user. The outputs of the fUserProfling. The two Taps tSource and tSinkI specify the two Pipes are used as the input data of the Pipe pUser Tag to nput and output of Flow fUserProfiling. The flow includes calculate the average probability of each tag being used to six Pipes: pLine, pltem Tag, pUserltem, pUser Tag, pTagluf tag the items of the user. The Pipe pUser Tag can be and pUserProfile. Some Pipes can be processed in parallel paralleled with the Pipe pTagluf that calculates the inverse while some need the output of another Pipe as input data. user frequency of each tag. After calculating the product of The program fragment of the implementation of this flow is the output results of the Pipe pUser Tag and pTagluf with the shown as below Pipe pUserProfile, the weight that indicate the degree of each Table 3. Program fragment of the Flow fUser Profiling user's preference to a tag can be obtained. The output results are stored as files in the output path. The output files can be Program fragment 1: the implementation of Flow fUser Profiling Ised as input by next step to calculate the similarity of users // The setting offlow fUser Profiling Step 2: neighborhood forming String input Path=args[o] The key job of this step is to calculate the similarity of String outputPathl=args[l] each user pair. To facilitate the implementation of a complete Tap tSource= new Hfs(new TextLineO, input Path) Tap tSinkl new Hfs(new TextLineO, outputPath1+"/", true recommender system in Cascading, we discuss how to arallel cosine similarity in Cascading. Figure 3 illustrates //The implementation of Pipe pline the proposed paralleled cosine similarity calculation Pipe pLine= new Each("pline", new Fields("line"), approach. Table 4 shows a program fragment of the RegexSplitter(new Fields("userI","iteml", " tagl"), implementation of this flow Table 4. Program fragment of the Flow nEighbourhood Forming /Count is an inner aggregator of Cascading that calculates the number of items in the current group *y Pipe pUseritem-new Group By ("pUserltem", pLine, new Program fragment 2: the implementation of Flor Fields("user", "item1"), new Fields("user1")); iNeighbourhoodForming pUserltem=new Every (pUserltem, new Fields("user","iteml new Count(new Fields("count1"))); A power 0 is a self defined sub class extends aggregator class"% //The implementations of other pipes are omitted Pipe pUserl = new Each("pUserI", new Fields("profileLine"),new RegexSplitter(new Fields(ul","tl","wl"))); Power Pipe pPowerl=new Group By ("pPowerI,PUserl, new Cascades. taps Map( Pipe pipes(pLine), Tap taps( tSource)); Fields("ul"),new Fields("ul")); Flow flow userprofile flow Connector connect("user Profiling pPowerl=new Every(pPowerl, new Fields("wl"),new power(new algorithm", source 1, tSinkl, pUser Prof Fields("powerI")), new Fields("ul","powerl));
Step 1: User Profiling Figure 2 illustrates the proposed paralleled user profiling process. The calculation of each user’s preference weight to each tag shown in Algorithm 1 can be implemented by Flow fUserProfiling. The two Taps tSource and tSink1 specify the input and output of Flow fUserProfiling. The flow includes six Pipes: pLine, pItemTag, pUserItem, pUserTag, pTagIuf and pUserProfile. Some Pipes can be processed in parallel while some need the output of another Pipe as input data. The program fragment of the implementation of this flow is shown as below. Table 3. Program fragment of the Flow fUserProfiling Program fragment 1: the implementation of Flow fUserProfiling // The setting of flow fUserProfiling String inputPath=args[0]; String outputPath1=args[1]; Tap tSource = new Hfs(new TextLine(), inputPath); Tap tSink1 = new Hfs(new TextLine(), outputPath1+"/",true); // The implementation of Pipe pLine Pipe pLine = new Each("pline", new Fields("line"), new RegexSplitter (new Fields("user1","item1","tag1"), "::")); // The implementation of Pipe pUserItem /*Count is an inner aggregator of Cascading that calculates the number of items in the current group */ Pipe pUserItem=new GroupBy("pUserItem", pLine, new Fields("user1","item1"), new Fields("user1")); pUserItem=new Every (pUserItem, new Fields("user1","item1"), new Count(new Fields("count1"))); //The implementations of other pipes are omitted … Map source_1 = Cascades.tapsMap( Pipe.pipes(pLine), Tap.taps(tSource)); Flow flow_userprofile = flowConnector.connect("userProfiling algorithm", source_1, tSink1, pUserProfile ); The Pipe pLine splits each line of each input file. The Pipe pItemTag calculates the probability of each tag for each item. It can be paralleled with the Pipe pUserItem that counts the number of tagged items of each user. The outputs of the two Pipes are used as the input data of the Pipe pUserTag to calculate the average probability of each tag being used to tag the items of the user. The Pipe pUserTag can be paralleled with the Pipe pTagIuf that calculates the inverse user frequency of each tag. After calculating the product of the output results of the Pipe pUserTag and pTagIuf with the Pipe pUserProfile, the weight that indicate the degree of each user’s preference to a tag can be obtained. The output results are stored as files in the output path. The output files can be used as input by next step to calculate the similarity of users. Step 2: neighborhood forming The key job of this step is to calculate the similarity of each user pair. To facilitate the implementation of a complete recommender system in Cascading, we discuss how to parallel cosine similarity in Cascading. Figure 3 illustrates the proposed paralleled cosine similarity calculation approach. Table 4 shows a program fragment of the implementation of this flow. Table 4. Program fragment of the Flow fNeighbourhoodForming Program fragment 2: the implementation of Flow fNeighbourhoodForming … // The implementation of Pipe pUserI /* power () is a self defined sub class extends aggregator class*/ Pipe pUserI = new Each ( "pUserI",new Fields ("profileLine"),new RegexSplitter(new Fields("u1","t1","w1"))); // The implementations of Pipe pPowerI Pipe pPowerI=new GroupBy ("pPowerI",pUserI,new Fields("u1"),new Fields("u1")); pPowerI=new Every (pPowerI,new Fields("w1"),new power(new Fields("power1")), new Fields("u1","power1") ); … [ሺݑ ,ݐ௫ሻ, ଵ |ೠ | · ∑ ݎ࣪ሺݐ௫ | ሻ ೖאೠ ] Pipe: pItemTag [ ݑ ,௫ሻݐ ,ሺ[ A [ሻݑሺ݂ݑ݅ ,௫ݐ ] Flow: fUserProfiling Tap: tSink1 Pipe: pUserTag GB: GroupBy CoG: CoGroupBy Pipe: pUserItem inputPath/inputFiles [ ,௫ݐ ,ݑ ] Pipe: pLine Split Tap: tSource Line [௫,ݓ ,௫ሻݐ ,ݑሺ[ outputPath1/UserProfiles CoG w Pipe: pUserProfile GB [ሻ | ௫ݐሺݎ࣪ ,௫ሻݐ ,ሺ[ Pr GB Count [௫ݐ ,ሻ ,ݑሺ[ [ሺݑ ,ሻ, |ܲ௨ |] [ሺݑ ,ݐ௫ሻ, ݎ࣪ሺݐ௫ | ሻ, |ܲ௨ |] CoG GB Pipe: pTagIuf Avg GB iuf [ ሺݐ௫ሻ, ሼݑሽ] Figure 2. Paralleled Data Flow of User Profiling 157
Pipe: pUserl Pipe: pPowerl profilling Pipe: pSimilaritylJ CoG Tap: tSink3 CoG outputPath1/UserProfiles utputPath2/Sim Files Pipe: mUltiply DJ Flow: NEighborhood Forming CoG: CoGroup By igure 3. Paralleled Data Flow of Neighborhood Forming rp Mainly, It parallels the calculation of the accumulated um of the power of each element of a vector and that of roduct of the non -zero value elements of two vectors The [(un,lu, sim(ui, uD)JJ uD, ipmc(ui pm) Flow NEighborhood Forming also includes six Pipes: pUserl lug,uj, sim(ui, uD)I pUserJ, pPowerl, pPowerJ, pMultiplyIJ, and pSimilaritylJ The input of this Flow is the output user profiles of Flow fUserProfiling. Two Pipes pUserl and pUserJ are used to split the input lines. The Pipe pPowerl and pPowerJ calculate the accumulated sum of the power of each element of the Tap: tSink4 wo tag vectors. They are paralleled with the Pipe pMultiplylJ that calculates the accumulated sum of the d Pipe: pCandidateltem product of the non-zero value elements of the two tag vectors The Pipe similarity IJ firstly calculates the root square of uj tr, pm) uj,PmI put Path3/RecFiles the product of the outputs of pPowerl and pPowerJ. Then, divided by the output of pMultiplylJ, the similarity of two users can be obtained. The output files are stored in another output path. The output of this Flow also can be used to Flow: rEcommending CoG: CoGroup By generate recommendations directly Step 3: Recommendation making igure 4 Paralleled Data Flow of Recommendation making Figure illustrates the proposed paralleled recommendation aking approach. The Flow Step 4: Parallel the above three steps rEcommending includes two Pipes: pCandidateltem and The above three steps have obvious dependency rEcommender relationship. Since Cascading will check the actual tagged by target user, the items Is the output of Flow dependency of Flow instances in run time, the three steps can The input of rEcommender fNeighborhood Forming. After filtering those items that are be paralleled. For example, if two or more Flow instances the top k have no dependencies, they will be submitted together so neighbours are selected as candidate items. The Pipe hey can execute in parallel pRecommender calculates the prediction score of each The program fragment of the parallel of the three steps candidate item for each user. After ranking the candidate ith Cascading is shown in Table 5 items by the prediction score, the final output files are stored in files in an output path
Mainly, it parallels the calculation of the accumulated sum of the power of each element of a vector and that of product of the non-zero value elements of two vectors. The Flow fNeighborhoodForming also includes six Pipes: pUserI, pUserJ, pPowerI, pPowerJ, pMultiplyIJ, and pSimilarityIJ. The input of this Flow is the output user profiles of Flow fUserProfiling. Two Pipes pUserI and pUserJ are used to split the input lines. The Pipe pPowerI and pPowerJ calculate the accumulated sum of the power of each element of the two tag vectors. They are paralleled with the Pipe pMultiplyIJ that calculates the accumulated sum of the product of the non-zero value elements of the two tag vectors. The Pipe pSimilarityIJ firstly calculates the root square of the product of the outputs of pPowerI and pPowerJ. Then, divided by the output of pMultiplyIJ, the similarity of two users can be obtained. The output files are stored in another output path. The output of this Flow also can be used to generate recommendations directly. Step 3: Recommendation making Figure 4 illustrates the proposed paralleled recommendation making approach. The Flow fRecommending includes two Pipes: pCandidateItem and pRecommender. The input of pRecommender is the output of Flow fNeighborhoodForming. After filtering those items that are tagged by target user, the items that tagged by the top K neighbours are selected as candidate items. The Pipe pRecommender calculates the prediction score of each candidate item for each user. After ranking the candidate items by the prediction score, the final output files are stored in files in an output path. Step 4: Parallel the above three steps The above three steps have obvious dependency relationship. Since Cascading will check the actual dependency of Flow instances in run time, the three steps can be paralleled. For example, if two or more Flow instances have no dependencies, they will be submitted together so they can execute in parallel. The program fragment of the parallel of the three steps with Cascading is shown in Table 5. topK rec outputPath3/RecFiles Tap: tSink4 CoG Pipe: pRecomender [ሻሽ ݑ ,ݑሺ݅݉ݏ , ݑሼ, ሻݑሺ[ [ሻሽ ,ݑሺࣛ ,ሼ, ሻݑሺ[ filter A Pipe: pCandidateItem [ ,௫ݐ , ݑ] B [ሻ ݑ ,ݑሺ݅݉ݏ , ݑ ,ݑ] Flow: fRecommending [ , ݑ] GB: GroupBy CoG: CoGroupBy |ܶ| ݖ,݆ݓ·ݖ,݅ݓ ∑ , ݑ ,ݑ] ൌ1ݖ ටሺ∑ ݓ,݅ݖ2ሻ·ሺ∑ ݓ,݆ݖ |ܶ| 2 ݖൌ1 ሻ |ܶ| ൌ1ݖ ] ௭,ݓ · ௭,ݓ ∑ ,ݑ ,ݑ ] |்| ௭ୀଵ ] Pipe: pPowerI profileLine Pipe: pUserJ Split Split power Pipe: pUserI CoG multiply power outputPath1/UserProfiles Tap: tSink2 [௫,ݓ ,௫ݐ ,ݑ] Pipe: pMultiplyIJ Pipe: pSimilarityIJ CoG CoG outputPath2/SimFiles Tap: tSink3 sim ௭,ݓ ∑ ,ݑ] |்| ଶ ௭ୀଵ ሿ Pipe: pPowerJ ௭,ݓ ∑ , ݑ] |்| ଶ ௭ୀଵ ሿ [௫,ݓ ,௫ݐ , ݑ] B Flow: fNeighborhoodForming GB: GroupBy CoG: CoGroupBy Figure 4. Paralleled Data Flow of Recommendation making Figure 3. Paralleled Data Flow of Neighborhood Forming 158
Table 5. Program fragment of the Parallel of the three steps tep 1. The setup of the development environment. The first step is to setup a development environment to develop Program fragment 3: parallel the three steps he customer application and make sure that the grammar and business logic is correct. As mentioned in 4.2.1, JDK / parallelRecsys class is the name of the self defined class file*/ 1.6.0, Eclipse 3.4.1, Hadoop 1. 18.3 and Cascading 1.0.18 Properties properties= new Propertieso; properties set Property(hadoop job. ugi","hadoop, hadoop") FlowConnector. setApplicationJar Class( properties Hadoop, we can run the Word Count example of Hadoop in parallelRecsys class the stand alone mode to test whether the development onment has been cor ector= new Cascade Connector Cascade cascade connector connect( fUser Profiling, fNeigh Whene p the developement eovironne nt has beeni seton borhood Forming, rEcommending ssfully, we can implement the customer appl cascade. complete build the executable jar file with Eclipse Step 3. Run and debug the custom executable jar file in V. EXPERIMENTS Cygwin or linux/Unix with Hadoop Datasets Step 4. Run the custom executable jar file in amazon We conducted the experiments with the Del icIo, us EC2 clouds dataset [10]. The dataset contains all public bookmarks of After the implemented customer application run about950,000usersretrievedfromhttp://delicious.comsuccessfullyinthestandalonemodeofHadoop,wecan between September 2003 and April 2008. The retrieval submit the application in real clouds such as Amazon EC2 process resulted in about 132 million bookmarks or 420 Before we could submit a job, we need to create an Amazon nillion tag assignments. The full corpus is about 7GB of EC2 account first. Then, we need to sign in the services of compressed data. It's one of the largest folksonomy datasets Amazon Elastic MapReduce and Amazon $3. S3 is used to used in research so far. the details of the dataset are store the input and output data while Emr is used for discussed in[111 omputation. Though we can use the command of Hadoop to download and upload data, the data transfer software is more Experiments setup convenient to help users to transfer data between local 1)Experimental environment machine and $3. After installed the Cloud Berry Explorer for The experimental environment includes the development Amazon $3 on the local machine, we can connect to Amazon environment and the computation environment. We used a S3 with the secure credential information. We need to create local desktop with internet access as the development a buck name in S3. We can create folders in the buck we machine. Since the operating system is Windows XP, we created and upload the custom jar and input data to the buck. installed the latest version of Cygwin as the shell to run After uploading the jar and input data to S3, we could run the Linux/Unix commands. Java is used as programming jar in EC2. Since the AwS console supports the job language and JDK1. 6.0 was installed. Eclipse 3.4.1 was submission and the management of jobs with web browser, it installed as the programming and building tool. Hadoop becomes easy to submit and run custom jobs in EC2. I 1. 18.3 and Cascading 1.0.18 were installed. The stand alone Elastic Map Reduce of AWS Management Console 0,we node of Hadoop was used to debug the programs can create a new work flow and select customer jar as the job The computation platform is Amazon EC2 Elastic type. Then specify the Jar Location and Arguments. The path MapReduce clouds(i.e, Amazon EMR). The clients can of Jar Location is /path/to/jar while the input select the types or sizes of the clouds and the payment is data path or output data pathare according to the actual running time. Amazon EMR supports s3n: //buckname>/path/to/input Cascading and Hadoop MapReduce and can be run in the s3n: //buckname>/path/to/output. It will cause "input path of WS console mode, which makes it easy and convenient to file doesnt exist error if the path setting is not correct. After submit and run the customer applications/jobs on Amazon selecting the size, CPU and memory type of the clouds, we EC2. The Amazon S3 is used to store the input and out data. can run the job in Amazon EMR. With the console, we can The software CloudBerry Explorer for Amazon S3 was easily debug and manage the submitted jobs. The output data installed in the local machine to transfer the data between the will be stored in the created buck name of S3. We can local machine with Amazon s download the output data to local machine with CloudBerry 2) Experimental environment setup Explorer for Amazon S3 The setup and configuration of the experimental environment is very important. The setting the To evaluate the effectiveness of the proposed paralleled development and computation environment is as below. ser profiling and recommendation approaches, we created three running Jobs on Amazon eC2 Emr clouds and conducted two comparison experiments on both clouds and a local desk top machine 9http://cloudberrylab.com/?page=cloudberry-explorer-amazons
Table 5. Program fragment of the Parallel of the three steps Program fragment 3: parallel the three steps /* parallelRecsys.class is the name of the self defined class file*/ Properties properties = new Properties(); properties.setProperty("hadoop.job.ugi", "hadoop,hadoop"); FlowConnector.setApplicationJarClass( properties, parallelRecsys.class ); FlowConnector flowConnector = new FlowConnector(properties); CascadeConnector connector = new CascadeConnector(); Cascade cascade = connector.connect( fUserProfiling, fNeigh borhoodForming, fRecommending ); cascade.complete(); V. EXPERIMENTS A. Datasets We conducted the experiments with the Del.icio.us dataset [10]. The dataset contains all public bookmarks of about 950,000 users retrieved from http://delicious.com between September 2003 and April 2008. The retrieval process resulted in about 132 million bookmarks or 420 million tag assignments. The full corpus is about 7GB of compressed data. It’s one of the largest folksonomy datasets used in research so far. The details of the dataset are discussed in [11]. B. Experiments setup 1) Experimental environment The experimental environment includes the development environment and the computation environment. We used a local desktop with internet access as the development machine. Since the operating system is Windows XP, we installed the latest version of Cygwin 6 as the shell to run Linux/Unix commands. Java is used as programming language and JDK1.6.0 was installed. Eclipse 3.4.17 was installed as the programming and building tool. Hadoop 1.18.3 and Cascading 1.0.18 were installed. The stand alone mode of Hadoop was used to debug the programs. The computation platform is Amazon EC2 Elastic MapReduce clouds 8 (i.e., Amazon EMR). The clients can select the types or sizes of the clouds and the payment is according to the actual running time. Amazon EMR supports Cascading and Hadoop MapReduce and can be run in the AWS console mode, which makes it easy and convenient to submit and run the customer applications/jobs on Amazon EC2. The Amazon S3 is used to store the input and out data. The software CloudBerry Explorer for Amazon S3 9 was installed in the local machine to transfer the data between the local machine with Amazon S3. 2) Experimental environment setup The setup and configuration of the experimental environment is very important. The setting of the development and computation environment is as below: 6 http://www.cygwin.com/ 7 http://www.eclipse.org/ 8 http://aws.amazon.com/elasticmapreduce/ 9 http://cloudberrylab.com/?page=cloudberry-explorer-amazon-s3 Step 1. The setup of the development environment. The first step is to setup a development environment to develop the customer application and make sure that the grammar and business logic is correct. As mentioned in 4.2.1, JDK 1.6.0, Eclipse 3.4.1, Hadoop 1.18.3 and Cascading 1.0.18 were installed. After we set up the JAVA_HOME for Hadoop, we can run the WordCount example of Hadoop in the stand alone mode to test whether the development environment has been configured successfully. Step 2. The implementation of the customer application. When the development environment has been setup successfully, we can implement the customer application and build the executable jar file with Eclipse. Step 3. Run and debug the custom executable jar file in Cygwin or linux/Unix with Hadoop. Step 4. Run the custom executable jar file in Amazon EC2 clouds. After the implemented customer application run successfully in the stand alone mode of Hadoop, we can submit the application in real clouds such as Amazon EC2. Before we could submit a job, we need to create an Amazon EC2 account first. Then, we need to sign in the services of Amazon Elastic MapReduce and Amazon S3. S3 is used to store the input and output data while EMR is used for computation. Though we can use the command of Hadoop to download and upload data, the data transfer software is more convenient to help users to transfer data between local machine and S3. After installed the CloudBerry Explorer for Amazon S3 on the local machine, we can connect to Amazon S3 with the secure credential information. We need to create a buck name in S3. We can create folders in the buck we created and upload the custom jar and input data to the buck. After uploading the jar and input data to S3, we could run the jar in EC2. Since the AWS console supports the job submission and the management of jobs with web browser, it becomes easy to submit and run custom jobs in EC2. In Elastic MapReduce of AWS Management Console 10 , we can create a new work flow and select customer jar as the job type. Then specify the Jar Location and Arguments. The path of Jar Location is /path/to/jar while the input data path or output data path are s3n:///path/to/input/ and s3n:///path/to/output. It will cause “input path of file doesn’t exist” error if the path setting is not correct. After selecting the size, CPU and memory type of the clouds, we can run the job in Amazon EMR. With the console, we can easily debug and manage the submitted jobs. The output data will be stored in the created buck name of S3. We can download the output data to local machine with CloudBerry Explorer for Amazon S3. C. Results and discussions To evaluate the effectiveness of the proposed paralleled user profiling and recommendation approaches, we created three running Jobs on Amazon EC2 EMR clouds and conducted two comparison experiments on both clouds and a local desk top machine. 10 http://aws.amazon.com/console/ 159
Job 1 was to conduct use on the that Cascading MapReduce and Cloud computing services proposed paralleled user profilin whole can provide low cost, user friendly and effective solution to Del icio us dataset. The clouds n is 10 nodes, help users to solve the scalability problem of user profiling high performance CPU and middle scaled clouds. It took 11 and recommender systems hours and18 minutes to finish the job We used one month Del icio. us data to conduct VI. CONCLUSIONS efficiency comparison experiments. Job 2 was to conduct In this paper, we discussed how to design and implement paralleled user profiling with the one mont hdata(May, 2004) a parallel user profiling approach based on folksonomy on a small scale clouds of Amazon EC2 EMR. It took 24 information and how to apply it to a scalable recommender minutes to process the one month data of the Delicious system. The popular Hadoop, Map Reduce and Cascading dataset with a 4 nodes small scale clouds. To evaluate the techniques are used to implement the paralleled user efficiency of the paralleled user profiling approach profiling and recommendation system. The experiments were implemented in Cascading MapReduce, we also implement conducted on Amazon EC2 Elastic Map Reduce and S3 with the user profiling Algorithm 1 with Java language in an the real world large scale dataset from Del icio us websites paralleled way. It was run in local desk machine with IG The experimental results show that the proposed paralleled memory,3G CPU Dell desk top. It took I hour 10 minutes to user profiling and recommendation making approaches can process the same data. The results of the running time of Job improve the effer a be used as an example or case study 2 are shown in Figure 5 This to help the readers to create and run their own applications with cloud computing techniques quickly. Another important contribution of this paper is that this work shows that cloud that have less cloud computing back ground knowledge to use. with easy 00000p understanding programming APls such as Cascading as well as user friendly third party computation cloud computing services such as Amazon Elastic MapReduce, users can rapidly develop complex large scale data processing Figure 5. The running time comparison of Job 2 ACKNOWLEDgMENT condut arable This work was conducted when the first author was an recommendation making process with the I moth data on the intern of the cloud computing internship and competition same scale clouds with Job 2. It took 2 hours 14 minutes to project of Australian Academy of Technological Sciences process the one month data with a 4 nodes small scale clouds. and Engineering(ATSE), January-March, 2010D Similar with Job 2, we also implemented recommendation lauthorsdwouldolikedtoDthankOtheDsupporto ofu Higho PerformanceD Computation (HPC)O groupD approach Algorithm 2 with Java in an unparalleled way and ofa Queensland Univer run it on the same local desk top machine. It took 7 hours to then sponsorship ofD ATSEO andD Australian process the same data. The running time the comparison of ResearchaCollaboration OServiceD(ARCS).D Job 3 are shown in Figure 6 REfERENCES us, G, and Tuzhilin, A, Toward the Next Generation of nder Systems: A Survey of the State-of-the-Art and Extensions, IEEE Transactions on Knowledge and Data uo, L, and Zhao, Y. E, Tag-based social interest discovery In proc. of www 08. 2008. 675-684 hemawat, Howard Gobioff, and Shun-Tak Leung. The oogle File System. The 19th ACM Symposium on Operating Systems Principles, 2003 14Elsayed, T, Lin, J, and Oard, D Large Collections with 46th Annual Figure 6. The running time com Discussions: Language Technologies, I Ithough it's difficult to fairly compare the running time 5 komopnickeh .e, ExMacting user profles from large scale sata. In profiling Proc. of the 2010 Workshop on Massive Data Analytics on the Cloud recommendation making approaches can effectively improve nttp://www.cascading.org/userguide/pdf/userguide.pdf the efficiency of large scaled data processing. It suggested
Job 1 was to conduct user profiling proposed paralleled user profiling approac Del.icio.us dataset. The clouds configurati high performance CPU and middle scaled c hours and18 minutes to finish the job. We used one month Del.icio.us d efficiency comparison experiments. Job 2 paralleled user profiling with the one mont h on a small scale clouds of Amazon EC2 E minutes to process the one month data o dataset with a 4 nodes small scale clouds. efficiency of the paralleled user pro implemented in Cascading MapReduce, we the user profiling Algorithm 1 with Java unparalleled way. It was run in local desk m memory, 3G CPU Dell desk top. It took 1 ho process the same data. The results of the run 2 are shown in Figure 5. Job 3 was to conduct the pro recommendation making process with the 1 same scale clouds with Job 2. It took 2 hou process the one month data with a 4 nodes s Similar with Job 2, we also implemented approach Algorithm 2 with Java in an unpa run it on the same local desk top machine. I process the same data. The running time th Job 3 are shown in Figure 6. Discussions: Although it’s difficult to fairly compare of clouds and a desk machine, the results Figure 6 show that paralleled user recommendation making approaches can eff the efficiency of large scaled data process Figure 5. The running time comparison Figure 6. The running time compariso g based on the ch for the whole ion is 10 nodes, clouds. It took 11 data to conduct was to conduct h data (May, 2004) EMR. It took 24 of the Delicious To evaluate the ofiling approach e also implement a language in an machine with 1G our 10 minutes to nning time of Job oposed parallel moth data on the urs 14 minutes to mall scale clouds. recommendation aralleled way and It took 7 hours to he comparison of the running time in Figure 5 and profiling and fectively improve ing. It suggested that Cascading MapReduce and C can provide low cost, user friendly help users to solve the scalability p and recommender systems. VI. CONCLUS In this paper, we discussed how a parallel user profiling approach information and how to apply it to system. The popular Hadoop, Ma techniques are used to impleme profiling and recommendation syste conducted on Amazon EC2 Elastic the real world large scale dataset fr The experimental results show that user profiling and recommendation improve the efficiency. This paper also can be used as a to help the readers to create and ru with cloud computing techniques qu contribution of this paper is that thi computing is ready for general us computing back ground knowled understanding programming APIs s as user friendly third party compu services such as Amazon Elastic rapidly develop complex large applications. ACKNOWLEDGM This work was conducted when intern of the cloud computing int project of Australian Academy of and Engineering (ATSE), January-M Theauthorswouldliketo of HighPerformanceCompu of Queensland University o the sponsorship of ATSE ResearchCollaborationService REFERENCE [1] Adomavicius, G., and Tuzhilin, A., T Recommender Systems: A Survey Possible Extensions, IEEE Transacti Engineering, 17(6):2005, pp. 734-749. [2] Li, X., Guo, L., and Zhao, Y. E., TagIn Proc. of WWW’08, 2008, 675-684. [3] Sanjay Ghemawat, Howard Gobioff Google File System. The 19th AC Systems Principles, 2003. [4] Elsayed, T., Lin, J., and Oard, D., Pa Large Collections with MapReduce, Meeting of the Association for Compu Language Technologies, pages 265–26 [5] Shmueli-Scheuer, M., Roitman, H Konopnicki, D., Extracting User Profi Proc. of the 2010 Workshop on Massiv [6] Cascading-User Guide V 1.0, http://www.cascading.org/userguide/pd n of Job 2 on of Job 3 loud computing services and effective solution to problem of user profiling SIONS to design and implement h based on folksonomy a scalable recommender apReduce and Cascading ent the paralleled user m. The experiments were MapReduce and S3 with rom Del.icio.us websites. t the proposed paralleled n making approaches can an example or case study un their own applications uickly. Another important is work shows that cloud sers that have less cloud dge to use. With easy uch as Cascading as well utation cloud computing MapReduce, users can scale data processing MENT n the first author was an ternship and competition f Technological Sciences March, 2010 othankthesupport utation(HPC) group of Technology, and E and Australian e(ARCS). ES Toward the Next Generation of of the State-of-the-Art and ons on Knowledge and Data based social interest discovery. f, and Shun-Tak Leung. The M Symposium on Operating airwise Document Similarity in In Proc. of the 46th Annual utational Linguistics on Human 68, 2008 H., Carmel, D., Mass, Y., files from Large Scale Data, In ve Data Analytics on the Cloud. df/userguide.pdf 160
v With weighted tags for personalized item recommend ations. nd lproc. Hogqvist Beyond, M, Parallel and Incremental Data Mining with Online Map-Reduce, In ofHT"10,51-60 Proc. of the 2010 Workshop on Massive Data Analytics on the Cloud [8]chiang,M.,waNg,T.,Peng,W.,ParallelizingRandomWalkwith[10]http:/robertwetzker.com/2009/06/25/delicious-dataset-for-download/ Restart for Large-Scale Query Recommendation, In Proc. of the 2010 [11] Wetzker 2010 Zimmermann, C, and Bauckhage, C, Analyzing Social Bookmarking Systems: A del icio us Cookbook, In Proc. Of the European Conference on Artificial Intelligence (ECAl), 2008
[7] Liang, H., Xu, Y., Li, Y., and Nayak, R., Connecting users and items with weighted tags for personalized item recommendations. In Proc. of HT’10, 51-60. [8] Chiang, M., Wang, T., Peng, W., Parallelizing Random Walk with Restart for Large-Scale Query Recommendation, In Proc. of the 2010 Workshop on Massive Data Analytics on the Cloud, 2010 [9] Böse, J., Andrzejak, A., HögqvistBeyond, M., Online Aggregation: Parallel and Incremental Data Mining with Online Map-Reduce, In Proc. of the 2010 Workshop on Massive Data Analytics on the Cloud. [10] http://robertwetzker.com/2009/06/25/delicious-dataset-for-download/ [11] Wetzker, R., Zimmermann, C., and Bauckhage, C., Analyzing Social Bookmarking Systems: A del.icio.us Cookbook, In Proc. Of the European Conference on Artificial Intelligence (ECAI), 2008. 161