Flex Recs: Expressing and combining Flexible recommendations Georgia Koutrika Bercovitz. Hector Garcia-Molina Computer Science Depa Stanford University, Stanford, California, USA outr O, hector/@cs. stanford. edu ABSTRACT Recommendation systems have become very popular but most rec- bedded in the system code, not expressed declaratively. From ommendation methods are 'hard-wired into the system making ex the designer viewpoint, this fact makes it hard to modify the perimentation with and implementation of new recommendation algorithm, or to experiment with different approaches radigms cumbersome. In this paper, we propose FlexRecs,a No Flexibility: The recommendations provided are fixed. End framework that decouples the definition of a recommendation pro. users are given few choices. For example, a user may be unable cess from its execution and supports flexible recommendations over to request recommendations for movies that could be jointly structured data. In FlexRecs, a recommendation approach can be seen by her and her friend, or that her recommendations be defined declaratively as a high-level parameterized workflow com- based on what people in her age group are watching. Users rising traditional relational operators and new operators that gen may expect diverse recommendations in different contexts erate or combine recommendations. We describe a prototype flex Limited world model: Recommendation approaches deal with ible recommendation engine that realizes the proposed framework two types of entities, users and items(e. g, movies), represented and we present example workflows and experimental results that as sets of ratings or features. Providing recommendations using show its potential for capturing multiple, existing or novel, recom- richer data representations is not straightforward. For example mendations easily and having a flexible recommendation system a user may want recommendations for courses from users with that combines extensibility with reasonable performance. similar grades and similar ratings Categories and Subject Descriptors <. In this paper, we propose FlexRecs, a framework that allows flex- nendations to be easily defined, customiz H.3.3 (Information Storage and Retrieval Information Search cessed over structured data. FlexRecs(a)decouples the definition and Retrieval-Search process of a recommendation process from execution, (b)declaratively de- fines a recommendation process as a high-level workflow and (c) General terms enables generating any recommendations with the same engine A given recommendation approach can be expressed as a high- Algorithms, Languages, Performa level workflow, which may contain traditional relational operators such as select, project and join, plus new recommendation opera- Keywords tors that generate or combine recommendations. A workflow handle data(including recommendations) in relational form. flexible recommendations, recommendation operators, recommen- a designer can easily create multiple, customizable workflows dation queries for content-based, collaborative, as well as novel recommendation paradigms. The end user can select from them, depending on her 1. INTRODUCTION information needs. This selection is done through a gul, which Recommendation systems provide advice on movies also allows the user to enter parameters for workflows in order travel, leisure activities, and many other topics, and I urate and personalized recommendations. For in- ery popular in systems, such as Google News [10]. Al stance, the user may specify that her recommendations be based and MovieLens 19]. Since the appearance of the first on what people in her age group are watching. This choice gets dation systems [12, 22, 261, many recommendation approaches ranslated into a select condition, which is passed to the appropri- have been proposed both by the industry and academia. However ate workflow. This functionality is essentially similar to advanced most recommendation systems have a number of limitations searches: a designer builds a set of parameterized SQL queries End users can execute these queries choosing parameter values to receive different results through an advanced search interface ble recommendation en realizes the proposed framework. The system allows executie is granted without fee provided that copies are workflow over a conventional DBMs by"compiling" it into a se not made or di rofit or commercial advantage and that copies quence of SQL calls. The recommendation operators may call upon bear this notice and the full citation on the first page. To copy otherwise, to functions in a library that implement common tasks for generating publish, to post on servers or to redistribute to lists, requires prior specifi recommendations, such as computing the Jaccard or Pearson sim of two sets of objects, dictions. When possible, library functions are compiled into the
FlexRecs: Expressing and Combining Flexible Recommendations Georgia Koutrika, Benjamin Bercovitz, Hector Garcia-Molina Computer Science Department, Stanford University, Stanford, California, USA {koutrika, berco, hector}@cs.stanford.edu ABSTRACT Recommendation systems have become very popular but most recommendation methods are ‘hard-wired’ into the system making experimentation with and implementation of new recommendation paradigms cumbersome. In this paper, we propose FlexRecs, a framework that decouples the definition of a recommendation process from its execution and supports flexible recommendations over structured data. In FlexRecs, a recommendation approach can be defined declaratively as a high-level parameterized workflow comprising traditional relational operators and new operators that generate or combine recommendations. We describe a prototype flexible recommendation engine that realizes the proposed framework and we present example workflows and experimental results that show its potential for capturing multiple, existing or novel, recommendations easily and having a flexible recommendation system that combines extensibility with reasonable performance. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search Process General Terms Algorithms, Languages, Performance Keywords flexible recommendations, recommendation operators, recommendation queries 1. INTRODUCTION Recommendation systems provide advice on movies, products, travel, leisure activities, and many other topics, and have become very popular in systems, such as Google News [10], Amazon [17] and MovieLens [19]. Since the appearance of the first recommendation systems [12, 22, 26], many recommendation approaches have been proposed both by the industry and academia. However, most recommendation systems have a number of limitations: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA. Copyright 2009 ACM 978-1-60558-551-2/09/06 ...$5.00. • Hard Wired: The recommendation algorithm is typically embedded in the system code, not expressed declaratively. From the designer viewpoint, this fact makes it hard to modify the algorithm, or to experiment with different approaches. • No Flexibility: The recommendations provided are fixed. End users are given few choices. For example, a user may be unable to request recommendations for movies that could be jointly seen by her and her friend, or that her recommendations be based on what people in her age group are watching. Users may expect diverse recommendations in different contexts. • Limited world model: Recommendation approaches deal with two types of entities, users and items (e.g., movies), represented as sets of ratings or features. Providing recommendations using richer data representations is not straightforward. For example, a user may want recommendations for courses from users with similar grades and similar ratings. In this paper, we propose FlexRecs, a framework that allows flexible recommendations to be easily defined, customized, and processed over structured data. FlexRecs (a) decouples the definition of a recommendation process from execution, (b) declaratively de- fines a recommendation process as a high-level workflow and (c) enables generating any recommendations with the same engine. A given recommendation approach can be expressed as a highlevel workflow, which may contain traditional relational operators such as select, project and join, plus new recommendation operators that generate or combine recommendations. A workflow can handle data (including recommendations) in relational form. A designer can easily create multiple, customizable workflows for content-based, collaborative, as well as novel recommendation paradigms. The end user can select from them, depending on her information needs. This selection is done through a GUI, which also allows the user to enter parameters for workflows in order to get more accurate and personalized recommendations. For instance, the user may specify that her recommendations be based on what people in her age group are watching. This choice gets translated into a select condition, which is passed to the appropriate workflow. This functionality is essentially similar to advanced searches: a designer builds a set of parameterized SQL queries. End users can execute these queries choosing parameter values to receive different results through an advanced search interface. We describe a prototype flexible recommendation engine that realizes the proposed framework. The system allows executing a workflow over a conventional DBMS by “compiling” it into a sequence of SQL calls. The recommendation operators may call upon functions in a library that implement common tasks for generating recommendations, such as computing the Jaccard or Pearson similarity of two sets of objects, or perform more fancy statistical predictions. When possible, library functions are compiled into the
Year, Term, InstrlD, Location, T ime Slot, Days what are yau coking far? Figure 1: Extract from the course database nts: in other cases we can rely on external functions led by the sQl statements. T 1.1 Motivation Our work on FlexRecs was motivated by, and has been imple- mented as part of CourseRank, a social tool we have developed 2: Fixed recommendations in Course Rank in our lab. CourseRank helps students in our university to make informed choices about classes and take advantage of the avail- proach X be more effective than approach Y in our environment? able learning options. It displays official university information What are the right weights for blending two recommendations(e.g and statistics, such as bulletin course descriptions, grade distri one based on what students like. and another based on what courses tions, and results of official course evaluations. Students can are more critical for completing a major)? What is the best way to anonymously rank courses they have taken, add comments, and predict, not good courses, but likely grades a student will obtain in rank the accuracy of each others'comments. They can also get future courses? Implementing many different recommendation ap- recommendations, organize their classes into a quarterly schedule proaches and their variants from scratch, possibly as different sy or devise a four year plan and track their progress. CourseRank also tem modules, can be time-consuming and counter-productive and functions as a feedback tool for faculty and administrators, ensur- does not lend to high code reusability. It also leads to a recommen ing that information is as accurate as possible. To support this func- dation system that is not easily expandable and manageable and it nakes experimenting with different recommendation possibilities the courses offered, the instructors, the students, the textbooks cumbersome and slow and so forth. Figure 1 provides a small snapshot of the database hema. The use inside the university, out of a total of about 14, 000 students. The describe recommendations. These descriptions would also need to vast majority of users are undergraduates, and there are only about handle parameters, so that end-users could at run time personal 7,000 undergraduates in our university. ize their recommendations, e.g., by choosing a selection condition, recommendations The need for flexibility and expressivity. Although the ini- ing Flex Recs framework has indeed made it possible to(a) easily tial version of Course Rank has been very popular [2], we received capture multiple recommendation paradigms and allow users dy nany requests, from students and administrators, for more flexible namically personalize them to fit their needs, (b)experiment with recommendations: As most commercial recommendation systems, novel recommendation strategies, (c)design a flexible and extensi our initial version offered no choices, just a list of recommended ble system and increase developer productivity: instead of adding courses for the student to consider. as shown in Figure 2. The new modules, a developer needs to define a new recommendation ure displays a general list of collaborative recommendations fo workflow or expand the library of reusable functions if a new com- particular student containing a Robotics course and a Spanish lan arison or prediction model is needed. The new version of Cours- guage course. If the student is interested in Spanish courses, she cRank employs the flexible recommendation engine to support di lay prefer to see more Spanish courses. If she is interested in ferent features and to let end-users tailor their recommendations French courses, she may not want to see any of our recommenda- We have also designed a user interface for flexible recommenda- tions. Students want to specify the type of course they are inter- tions in CourseRank [16 ested in(e.g, a biology class, something that satisfies the univer sity's writing requirement). They also want to request recommen- 1.2 Contributions dations based on a peer group they select (e. g, students in their major, or freshmen only)and based on different criteria For exam In summary, the contributions of our work are as follows: ple, a student may want recommendations for CS courses from C We propose decoupling the definition of a recommendation pro students with similar grades(i.e, with similar performance)and for cess from its execution and present FlexRecs, a framework for dance classes from students with similar ratings (i.e. with similar fining flexible recommendations over structured data. In Flex astes). In some cases, students want to see the recommendations Recs, a recommendation approach can be defined declaratively the system would give their best friend, not themselves s a high-level parameterized workflow comprising traditional The need for experimentation and higher productivity. As rec- relational operators ommendations comprise an integral part of the system, we want to We introduce an extend operator that generates a virtual experiment with different recommendation approaches: Would relation. For example, a tuple in an extended student rel ar stories are known from other domains [1, 5]. For y contain an attribute that gives all the courses taken by Amazon customer that once bought a book for his student. Although there is nothing new in the concept of a favor of milar to other kidsbooks buyers and is bound to see makes the whole framework work: extended relations simplify recommendations in his/her future transactions with the system. the definition of our other operator
Departments(DepID, DepCode, Name) Courses(CourseID, DepID, Title, Description, Units, Url) CourseSched(CourseID, Year, Term, InstrID, Location, TimeSlot, Days) Instructors(InstrID, Name, Url) Students(SuID, Name, Class, GPA, Status) StudentStudies(SuID, StudyPrgID) StudyPrograms(StudyPrgID, ProgramName, Classification, DepID) StudentHistory(SuID, CourseID, Year, Term, Grade, Rating) Comments(SuID, CourseID, Year, Term, Text, Rating, Date) Figure 1: Extract from the course database SQL statements; in other cases we can rely on external functions that are called by the SQL statements. 1.1 Motivation Our work on FlexRecs was motivated by, and has been implemented as part of CourseRank, a social tool we have developed in our lab. CourseRank helps students in our university to make informed choices about classes and take advantage of the available learning options. It displays official university information and statistics, such as bulletin course descriptions, grade distributions, and results of official course evaluations. Students can anonymously rank courses they have taken, add comments, and rank the accuracy of each others’ comments. They can also get recommendations, organize their classes into a quarterly schedule or devise a four year plan and track their progress. CourseRank also functions as a feedback tool for faculty and administrators, ensuring that information is as accurate as possible. To support this functionality, we maintain a database that stores rich information about the courses offered, the instructors, the students, the textbooks, and so forth. Figure 1 provides a small snapshot of the database schema. The system is already used by more than 9,000 students inside the university, out of a total of about 14,000 students. The vast majority of users are undergraduates, and there are only about 7,000 undergraduates in our university. • The need for flexibility and expressivity. Although the initial version of CourseRank has been very popular [2], we received many requests, from students and administrators, for more flexible recommendations: As most commercial recommendation systems, our initial version offered no choices, just a list of recommended courses for the student to consider, as shown in Figure 2. The figure displays a general list of collaborative recommendations for a particular student containing a Robotics course and a Spanish language course. If the student is interested in Spanish courses, she may prefer to see more Spanish courses. If she is interested in French courses, she may not want to see any of our recommendations1 . Students want to specify the type of course they are interested in (e.g., a biology class, something that satisfies the university’s writing requirement). They also want to request recommendations based on a peer group they select (e.g., students in their major, or freshmen only) and based on different criteria. For example, a student may want recommendations for CS courses from CS students with similar grades (i.e., with similar performance) and for dance classes from students with similar ratings (i.e., with similar tastes). In some cases, students want to see the recommendations the system would give their best friend, not themselves. • The need for experimentation and higher productivity. As recommendations comprise an integral part of the system, we want to experiment with different recommendation approaches: Would ap- 1 Similar stories are known from other domains [1, 5]. For instance, an Amazon customer that once bought a book for his 9-year-old will be considered a kids’ books fan or be mistakenly considered similar to other kids’ books buyers and is bound to see irrelevant recommendations in his/her future transactions with the system. Figure 2: Fixed recommendations in CourseRank proach X be more effective than approach Y in our environment? What are the right weights for blending two recommendations (e.g., one based on what students like, and another based on what courses are more critical for completing a major)? What is the best way to predict, not good courses, but likely grades a student will obtain in future courses? Implementing many different recommendation approaches and their variants from scratch, possibly as different system modules, can be time-consuming and counter-productive and does not lend to high code reusability. It also leads to a recommendation system that is not easily expandable and manageable and it makes experimenting with different recommendation possibilities cumbersome and slow. Faced with the above challenges, we need a way to declaratively describe recommendations. These descriptions would also need to handle parameters, so that end-users could at run time personalize their recommendations, e.g., by choosing a selection condition, or by weighting recommendations that are blended. The resulting FlexRecs framework has indeed made it possible to (a) easily capture multiple recommendation paradigms and allow users dynamically personalize them to fit their needs, (b) experiment with novel recommendation strategies, (c) design a flexible and extensible system and increase developer productivity: instead of adding new modules, a developer needs to define a new recommendation workflow or expand the library of reusable functions if a new comparison or prediction model is needed. The new version of CourseRank employs the flexible recommendation engine to support different features and to let end-users tailor their recommendations. We have also designed a user interface for flexible recommendations in CourseRank [16]. 1.2 Contributions In summary, the contributions of our work are as follows: • We propose decoupling the definition of a recommendation process from its execution and present FlexRecs, a framework for defining flexible recommendations over structured data. In FlexRecs, a recommendation approach can be defined declaratively as a high-level parameterized workflow comprising traditional relational operators and new recommendation operators. • We introduce an extend operator that generates a virtual nested relation. For example, a tuple in an extended student relation may contain an attribute that gives all the courses taken by that student. Although there is nothing new in the concept of a nested relation, this particular flavor of nested relation is what makes the whole framework work: extended relations simplify the definition of our other operators
We define recommend and blend operators that capture the essence ommend top N items to a user. Users and items are represented of most recommendation workflows. The operators can be com- by rudimentary profiles(e.g, as sets of ratings or keywords) and posed and combined with more traditional select, project, and recommendations are based on profile comparisons. For example The main challenge in designing these opera the content-based component of the Fab system [6] recommends ing the appropriate functionality and generality. web pages to users and represents web page content with the 100 care, one can easily come up with operators that most important words. Similarly, the Syskill Webert system are not useful for common recommendations represents documents with the 128 most informative words [211 or are unnecessarily complex While recommendations are based on a limited understanding of ovide several examples that illustrate how common rec- the world, many real applications use much richer data residing ndation strategies( found in literature or implemented in in databases. Different types of entities may co-exist in a single Rank) can be expressed as parameterized workflows with database(e.g, books, authors, publishers, and so forth) with sev- ur operators, as well as new recommendation types. eral attributes. Current approaches are not very expressive, e.g they cannot provide recommendations for different entities taking into account different subsets of their attributes realizes the proposed framework and executes multiple work The inherent limitations of recommendation systems have been flows transparently. Our strategy reduces the amount of data saved in temporary relations during execution. For instance, knowledged [5, 15] and some extensions have been recently pro- osed, such as incorporating multi-criteria ratings into recommen extended relations are never materialized. Furthermore, our dations [3]. The language RQL has been the first proposal that al- new operators can be compiled into standard SQL for execu tion. Hence recommendation workflows are converted into se- lows users to formulate recommendations in a flexible manner [41 However, RQL queries are not very expressive because they quences of standard SQL calls, which are then executed by a formulated on a pre-specified multidimensional cube of ratings conventional database system taking advantage of the underly ng query optimization. The recommendation operators may In contrast to existing works on recommendations, in this paper, call upon functions in a library. When possible, library func- we do not propose or extend a particular recommendation method. tions are also compiled into the sQL statements. We propose a general and open framework for expressing and pro- cessing fexible recommendations over relational data that covers We present experimental results that show the potential of our existing as well as allows capturing novel recommendation paradign approach for capturing multiple, existing or novel, recommen- As part of our effort, we have also implemented a flexible recom- dations easily and having a flexible recommendation system mendation interface in our system that allows students to request that combines extensibility with reasonable performance. For nd customize their recommendations [16] our evaluation, we use real data from our production system. 2. RELATED WORK 3. RECOMMENDATION FRAMEWORK Since the appearance of the first papers on collaborative filter- 3.1 Data model [12, 22, 261, a lot of work has been done from improving and evaluating recommendation methods [3, 13, 27] to designing trust- Large amounts of data reside in structured form, and particularl in relational form, such as e-store, university, corporate, and scien- thy systems [18, 20). In their core, these systems provide (a) tific data. Our own university database is relational too. Therefore, content-based recommendations for items similar to those the use we focus on databases that follow the relational model. which is preferred in the past [7, 21]or(b)collaborative recommendations enriched with some additional features for items that people with similar tastes liked [8, 22] or(c)h brid recommendations by combining recommendation methods [6, a database comprises a set of relations. A relation R(denoted 24]. Although a popular and highly researched area, recommen- Ri when more than one relation is implied) has a set A of at- tributes. We will use R. A, to refer to an attribute in A or simply dation systems present important limitations for users as well as A. when R is understood. A tuple r in R is a vector of single, researchers and developers [ 5] The recommendation algorithm as well as its inputs are hard refer to the value of A, in tuple r. An attribute that can be instan- menting with new methods can be time-consuming and counter tiated to a single value v is called base attribute. A relation with only base attributes is called base relation productive [15]. Furthermore, hard-wired algorithms generate only a predefined and fixed set of recommendations, which cannot ca In addition, we introduce the concept of an extended relation. ture the real-time user information needs. For example, in content- DEFINITIOn 1. An extended relation S has base attributes and based methods, the user is limited to see items that are similar to extended attributes. An extended attribute Ei(Al,. Ae)is instan those she liked in the past. On the other hand, collaborative filter- tiated to a relation with attributes A,.A ing methods provide serendipitous recommendations but they have problems, such as the inability to recommend new items [6]. There Given a tuple s in the relation S, s[Eu is a with attribute ave been many approaches that try to solve some of these prob Al,..Ae. If t is a tuple in this nested relation, then t[A, gives lems, such as filtering out items if they are too similar to those seen the value of attribute A, in this relation. We will use Ei. A, to refer before [7 or using genetic algorithms [27]. Several recommen to an attribute that is part of an extended attribute E, or simply Aj dation systems use a hybrid approach to avoid certain limitations when Ei is understood, and we call A, an embedded attribute of content-based and collaborative systems [6, 24]. However, of Consequently, under our semantics, the notion of an attribute ten different recommendations may be required under different cir- value has a broader meaning capturing scalar values as well as re- cumstances by different users. Current approaches do not support lations. We consider that only base relations can be part of a tuple fexible customizable recommendations allowing one level of nesting While our model and language could Furthermore, they deal with applications that have two types of be generalized to arbitrary nesting, we have not seen a need for entities, users and items(e.g, movies, web pages) and they rec- this generality in any of the practical recommendation scenarios we
• We define recommend and blend operators that capture the essence of most recommendation workflows. The operators can be composed and combined with more traditional select, project, and join operators. The main challenge in designing these operators is in capturing the appropriate functionality and generality. Without a lot of care, one can easily come up with operators that do not compose, are not useful for common recommendations or are unnecessarily complex. • We provide several examples that illustrate how common recommendation strategies (found in literature or implemented in CourseRank) can be expressed as parameterized workflows with our operators, as well as new recommendation types. • We describe a prototype flexible recommendation engine that realizes the proposed framework and executes multiple work- flows transparently. Our strategy reduces the amount of data saved in temporary relations during execution. For instance, extended relations are never materialized. Furthermore, our new operators can be compiled into standard SQL for execution. Hence, recommendation workflows are converted into sequences of standard SQL calls, which are then executed by a conventional database system taking advantage of the underlying query optimization. The recommendation operators may call upon functions in a library. When possible, library functions are also compiled into the SQL statements. • We present experimental results that show the potential of our approach for capturing multiple, existing or novel, recommendations easily and having a flexible recommendation system that combines extensibility with reasonable performance. For our evaluation, we use real data from our production system. 2. RELATED WORK Since the appearance of the first papers on collaborative filtering [12, 22, 26], a lot of work has been done from improving and evaluating recommendation methods [3, 13, 27] to designing trustworthy systems [18, 20]. In their core, these systems provide (a) content-based recommendations for items similar to those the user preferred in the past [7, 21] or (b) collaborative recommendations for items that people with similar tastes liked [8, 22] or (c) hybrid recommendations by combining recommendation methods [6, 24]. Although a popular and highly researched area, recommendation systems present important limitations for users as well as researchers and developers [5]. The recommendation algorithm as well as its inputs are hard wired in the system code. Designing, implementing and experimenting with new methods can be time-consuming and counterproductive [15]. Furthermore, hard-wired algorithms generate only a predefined and fixed set of recommendations, which cannot capture the real-time user information needs. For example, in contentbased methods, the user is limited to see items that are similar to those she liked in the past. On the other hand, collaborative filtering methods provide serendipitous recommendations but they have problems, such as the inability to recommend new items [6]. There have been many approaches that try to solve some of these problems, such as filtering out items if they are too similar to those seen before [7] or using genetic algorithms [27]. Several recommendation systems use a hybrid approach to avoid certain limitations of content-based and collaborative systems [6, 24]. However, often different recommendations may be required under different circumstances by different users. Current approaches do not support flexible, customizable, recommendations. Furthermore, they deal with applications that have two types of entities, users and items (e.g., movies, web pages) and they recommend top N items to a user. Users and items are represented by rudimentary profiles (e.g., as sets of ratings or keywords) and recommendations are based on profile comparisons. For example, the content-based component of the Fab system [6] recommends web pages to users and represents web page content with the 100 most important words. Similarly, the Syskill & Webert system represents documents with the 128 most informative words [21]. While recommendations are based on a limited understanding of the world, many real applications use much richer data residing in databases. Different types of entities may co-exist in a single database (e.g., books, authors, publishers, and so forth) with several attributes. Current approaches are not very expressive, e.g., they cannot provide recommendations for different entities taking into account different subsets of their attributes. The inherent limitations of recommendation systems have been acknowledged [5, 15] and some extensions have been recently proposed, such as incorporating multi-criteria ratings into recommendations [3]. The language RQL has been the first proposal that allows users to formulate recommendations in a flexible manner [4]. However, RQL queries are not very expressive because they are formulated on a pre-specified multidimensional cube of ratings. In contrast to existing works on recommendations, in this paper, we do not propose or extend a particular recommendation method. We propose a general and open framework for expressing and processing flexible recommendations over relational data that covers existing as well as allows capturing novel recommendation paradigms. As part of our effort, we have also implemented a flexible recommendation interface in our system that allows students to request and customize their recommendations [16]. 3. RECOMMENDATION FRAMEWORK 3.1 Data Model Large amounts of data reside in structured form, and particularly in relational form, such as e-store, university, corporate, and scientific data. Our own university database is relational too. Therefore, we focus on databases that follow the relational model, which is enriched with some additional features. A database comprises a set of relations. A relation R (denoted Ri when more than one relation is implied) has a set A of attributes. We will use R.Aj to refer to an attribute in A or simply Aj when R is understood. A tuple r in R is a vector of single, numerical or categorical, values. We will use the notation r[Aj ] to refer to the value of Aj in tuple r. An attribute that can be instantiated to a single value v is called base attribute. A relation with only base attributes is called base relation. In addition, we introduce the concept of an extended relation. DEFINITION 1. An extended relation S has base attributes and extended attributes. An extended attribute Ei(A1, ... Ae) is instantiated to a relation with attributes A1, ... Ae. Given a tuple s in the relation S, s[Ei] is a relation with attributes A1, ... Ae. If t is a tuple in this nested relation, then t[Aj ] gives the value of attribute Aj in this relation. We will use Ei.Aj to refer to an attribute that is part of an extended attribute Ei or simply Aj when Ei is understood, and we call Aj an embedded attribute. Consequently, under our semantics, the notion of an attribute value has a broader meaning capturing scalar values as well as relations. We consider that only base relations can be part of a tuple allowing one level of nesting. While our model and language could be generalized to arbitrary nesting, we have not seen a need for this generality in any of the practical recommendation scenarios we
Students a is a list of base. embedded or extended attributes from r If A contains only base attributes, then the resulting relation is always a base one. Join, Mc, creates a new relation by combining tuples in Ri Ext_Students SuD Name Comments(CourselD), Rating. I and R, that meet some condition c, which refers only to base Paul Lite C1 I 2Feb2008 attributes of the two relations. i.e R:∞xB1:={(r,r)r∈RAr;∈B;An,r; satisfy c 15Mar2007 12Dec2007 A common type of join is natural join, which connects two c56622Jn2007 relations by equating attributes of the same name, and project- c7722Jn20 ing out one copy of each pair of equated attributes. This is The base opera handle extended relations Figure 3: A base and an extended relation and be combined with the new recommendation operators to be de- fined next but we have kept their original semantics. One could go have studied and implemented and we do not want to unnecessarily further and define operators with extended semantics. For example burden our design one could define that the select operator operates also on extended Example: We consider the course database with the base rela- attributes. There is a rich literature on nested relational algebras tions shown in Figure 1. figure 3 shows an instance of the Students that discuss extended, recursive operations such as nesting, unn base relation. We can define the extended relation Ext_ Students that g and projections and joins on embedded relations [9, 11, 25 contains all the base attributes of a student plus an extended at- We believe that such generality is not necessary for practical rec- tribute that represents the ratings made by each student as a singl ommendations. Our framework can be easily extended to handle "unit of information"per student more generalized relational operators Ext_Students(SulD, Name, Comments( CourselD, Rating, Date)) 3.3 The Extend Operator Figure 3 shows an instance for this relation. To refer to the ex with many(normalized) database schemas, information that con- tended attribute we use Ext Students. Comments, and to the rating at ceptually refers to a single entity(e. g, a student's course ratings) tribute within Ext._ Students, we use Comments. Rating or simply Rating. is often found in several relations. In our s. ded)relation.For we want to repre- In a similar way, we can define other extended relations, such sent such application entities with a single(e the extended relation Ext Courses that extends the base course a tuple in an extended relation could contain base infor- relation with an attribute that shows all the instructors per year and n a student(e. g, name, GPA), plus the set of courses a term that a course was taught as taken. For this purpose, we introduce a new operator Ext_ Courses(CourselD, DepID, Title, Instructors( Name, Year, Term)) es such extended attributes in the tuples of a relation In Section 3.3, we will show how these example extended rela- DEFINITION 2. Extend, E, creates an extended relation by em tions can be defined using the extend operator bedding a base relation Ri into another relation Ri, i.e. Extended relations can be thought of as"views e;:={(r;,v)lr∈R:At:=A(n凶R;)} group together information related to an individual resent it as a single tuple that can be easily handled by the recom- where A is the set of attributes of R mendation operators irrespective of the database structure (as we In words, Ri E Ri is an extended relation that contains all the will see in Sections 3.4-3.6). attributes and tuples from Ri plus an extended attribute whose(ex- Although the issue of whether extended relations are material- tended)value for each tuple ri from Ri is a set of tuples from R ized or not is orthogonal to their definition, in practice, extended that join to ri on the common attributes. If there are no joining relations may not be stored in the database. In the following sub- tuples, then the extended attribute will be an empty relation for ri ections, we describe the operators required to handle extended re- The default name of this extended attribute is the name of R, lations and formulate recommendations. We start with the standard In the above definition, only R is required to be a bas (core)relational algebra operators with set semantics and we extend because we allow one-level nesting. Ri can be a base or them enabling interaction with extended relations as well relation, since any extended relation can have more than ended attribute. The extend operator can be defined for 3.2 Base Operators level nesting as well. We consider the following operators that can operate on base and Example: Using the extend operator we can generate the ex- Select, ac, selects tuples from relation R, for which the con- PComments: =(SulD, CourselD, Rating. Date)(Comments) dition c holds. ie Ext_Students : T (SulD, Name)(Students) PComments c(B):={r|r∈R∧ r satisfies c} R can be a base or extended relation and c is a condition that combine information found in more than two relations as follows. refers to base attributes of R. o(R)will be a base or extended relation depending on the type of R. Schedules T(CourselD, ear, Term, strid)( Course Sched) reates a new relation by projecting R into a Ext_ Courses =T(CourselD, DeplD, Tte( Courses) Instruct_ Sched aller set of its attributes. i.e. Note that joining over attributes with common names simplifies 丌A(B):={r{A]r∈R the presentation and the definition of the extend operator. It does
Students SuID Name Class Status 1 2 Paul Little John Doe 2009 2010 H N Ext_Students SuID Name Comments (CourseID, Rating, Date) 1 2 Paul Little John Doe C1 C2 5 6 2 Feb 2008 3 Dec 2007 C1 C2 5 6 15 Mar 2007 12 Dec 2007 C5 6.6 22 Jun 2007 C7 7 22 Jun 2007 Figure 3: A base and an extended relation have studied and implemented and we do not want to unnecessarily burden our design. Example: We consider the course database with the base relations shown in Figure 1. Figure 3 shows an instance of the Students base relation. We can define the extended relation Ext_Students that contains all the base attributes of a student plus an extended attribute that represents the ratings made by each student as a single “unit of information” per student: Ext_Students(SuID, Name, Comments(CourseID, Rating, Date)) Figure 3 shows an instance for this relation. To refer to the extended attribute, we use Ext_Students.Comments, and to the rating attribute within Ext_Students, we use Comments.Rating or simply Rating. In a similar way, we can define other extended relations, such as the extended relation Ext_Courses that extends the base course relation with an attribute that shows all the instructors per year and term that a course was taught: Ext_Courses(CourseID, DepID, Title, Instructors(Name, Year, Term)) In Section 3.3, we will show how these example extended relations can be defined using the extend operator. Extended relations can be thought of as “views” that collect and group together information related to an individual entity and represent it as a single tuple that can be easily handled by the recommendation operators irrespective of the database structure (as we will see in Sections 3.4 − 3.6). Although the issue of whether extended relations are materialized or not is orthogonal to their definition, in practice, extended relations may not be stored in the database. In the following subsections, we describe the operators required to handle extended relations and formulate recommendations. We start with the standard (core) relational algebra operators with set semantics and we extend them enabling interaction with extended relations as well. 3.2 Base Operators We consider the following operators that can operate on base and extended relations. • Select, σc, selects tuples from relation R, for which the condition c holds, i.e., σc(R) := {r|r ∈ R ∧ r satisfies c} R can be a base or extended relation and c is a condition that refers to base attributes of R. σc(R) will be a base or extended relation depending on the type of R. • Project, πA, creates a new relation by projecting R into a smaller set of its attributes, i.e., πA(R) := {r[A]|r ∈ R } A is a list of base, embedded or extended attributes from R. If A contains only base attributes, then the resulting relation is always a base one. • Join, ./c, creates a new relation by combining tuples in Ri and Rj that meet some condition c, which refers only to base attributes of the two relations, i.e., Ri ./cRj := {(ri, rj )|ri ∈ Ri ∧ rj ∈ Rj ∧ ri, rj satisfy c} A common type of join is natural join, which connects two relations by equating attributes of the same name, and projecting out one copy of each pair of equated attributes. This is denoted Ri ./ Rj (the condition c is implied.) The base operators defined above can handle extended relations and be combined with the new recommendation operators to be de- fined next but we have kept their original semantics. One could go further and define operators with extended semantics. For example, one could define that the select operator operates also on extended attributes. There is a rich literature on nested relational algebras that discuss extended, recursive operations such as nesting, unnesting and projections and joins on embedded relations [9, 11, 25]. We believe that such generality is not necessary for practical recommendations. Our framework can be easily extended to handle more generalized relational operators. 3.3 The Extend Operator With many (normalized) database schemas, information that conceptually refers to a single entity (e.g., a student’s course ratings) is often found in several relations. In our system, we want to represent such application entities with a single (extended) relation. For instance, a tuple in an extended relation could contain base information on a student (e.g., name, GPA), plus the set of courses a student has taken. For this purpose, we introduce a new operator that creates such extended attributes in the tuples of a relation. DEFINITION 2. Extend, ε, creates an extended relation by embedding a base relation Rj into another relation Ri, i.e., Ri ε Rj := {(ri, v)|ri ∈ Ri ∧ v := πA(ri ./ Rj )} where A is the set of attributes of Rj . In words, Ri ε Rj is an extended relation that contains all the attributes and tuples from Ri plus an extended attribute whose (extended) value for each tuple ri from Ri is a set of tuples from Rj that join to ri on the common attributes. If there are no joining tuples, then the extended attribute will be an empty relation for ri. The default name of this extended attribute is the name of Rj . In the above definition, only Rj is required to be a base relation because we allow one-level nesting. Ri can be a base or extended relation, since any extended relation can have more than one extended attribute. The extend operator can be defined for multiplelevel nesting as well. Example: Using the extend operator we can generate the extended relation Ext_Students described earlier as follows: PComments := π{SuID,CourseID,Rating,Date}(Comments) Ext_Students := π{SuID,Name}(Students) ε PComments In order to generate the extended relation Ext_Courses, we need to combine information found in more than two relations, as follows: Schedules := π{CourseID,Year,Term,InstrID}(CourseSched) Instruct_Sched := π{InstrID,Name}(Instructors) ./ Schedules Ext_Courses := π{CourseID,DepID,Title}(Courses) ε Instruct_Sched Note that joining over attributes with common names simplifies the presentation and the definition of the extend operator. It does
nationspaond antr retes thas can be used witp esis iye rat or e relations and attributes can be renamed accordingly. We can use Probability[A, B, RI(t, s)=oRA=LLAI(R)DAB OR A=sAJ(R)I a rename operator PIR(B1,.Bn)(R)that produces a relation JoR.A=s(Aj(R) identical to but with name R and attributes. in order. named This function computes the number of pairs of tuples from a re B1,... Bn. For example, we can rename COmments to Comments in lation R with the same value on an attribute B and with t(a as order to have the relation Ext Students as shown in Figure 3 the value of the attribute A in one of the pair's tuples and s(A Ext_Students: T(SMD, Name)(Students)E P[Comments(PComments) as the value in the other pair tuple divided by the number of tu- les in R with s a as the value of the attribute A. For example 3.4 The Recommend operator Probability[CourselD, SulD, StudentHistory(t, s)finds the similarity etween two courses based on the number of students (identified by Recommendations are based on comparisons(e.g rated by comparing their topics to a student 's interests, students are their SulD)that took both courses t and s(identified by their course mpared to a particular student based on their ratings in order to id CourselD)divided by the total number of students that took s find similar students, and so forth). In order to"build" recommen- for comparing two extended values, each one belonging to one of lations, we will have a library of c on les omparison functions that imple- the input tuples, using some metric, such as the Euclidean distance, ks of the form defined belo Pearson coefficient. etc. DEFINITION 3. A comparison function cf is a parameterized function that maps a tuple t to a single scalar value u by comparing it to another tuple s. Euclidean(E, A1,A2(t,s=2([]-jlA2) u=cfIP(t, s) i∈fE 135=1 where t and s are the two inputs to be compared and P denotes the parameters used in the function. This function compares two tuples t and s on their extended at- tribute E by summing up the squared differences of the values of For example, we may wish to apply a comparison function to attribute A, in all tuples i and j from the nested relations t(E ome part of two tuples, i.e to an attribute A of them; then, we and s[E], respectively, that join on Al. Then, we could, for in- would specify cf[](t, s). It could also be to a set of attributes for stance, use Euclidean(Comments, CourselD, RatingI(t, s)to compare multi-criteria recommendations [4]. We do not make any particular students based on their common ratings, where students t ands are ssumptions for the tuples t and s, e.g, that they should have the taken from Ext._Students ame number of attributes or that corresponding attributes should ave the same type, allowing in this way a wide spectrum of func- Comparisons of single values to extended values. We could defin tions to be defined, as the forthcoming examples will illum a comparison function that compares two values of different e. g, a single value to an extended value, i. e, to a relation E. For function that computes a relationship between its two inputs based example, we could define a function that tries to locate a single n either probabilistic approaches(e.g,[14 )or more traditional tem-to-item correlations. such as pearsons correlation Kendalls attribute B in the tuple where the value was found, i.e. tau, Euclidean distance, and Jaccard index, as well as user-defined Identify[A,E,B(t,s)=叫[Bif彐∈sEst.t4]=v[4 lomain-specific metrics. It could also be a prediction model learn For instance. if t is a course and s is a student from Ext Students offline(e.g, (81). We illustrate several possibilities below Identify[CourselD, Comments, Ratingl(t, s)returns the rating of Comparisons of string values. For example, we could consider dent s for course t based on the course id. In Section 3.6, we will a function that measures the Jaccard similarity between two string ee examples that illustrate how these functions can be used for values, each one belonging to one of the input tuples of the function generating recommendations and treated, for the purposes of comparison, as a bag of words, i.e Comparison functions compare one tuple to another tuple. It is Jaccard[4(t,s)=14】ns4 frequently desirable to compare one tuple to a set of tuples, e. g |t[4Us[4]l a student to a group of students. For this purpose, we define an Using this function, we could, for example, compare course de- aggregation comparison function scriptions using Jaccard[ DescriptionI(t, s), where t and s represent DEFINITION 4. An aggregation comparison function a is a pa- ourses, or compare instructors on the basis of their names with rameterized function that maps a tuple t to a single scalar value Jaccard[Namel(t, s), where t and s are two tuples in Instructors by comparing it to a set of tuples s Comparisons of numerical values. For example, we could use a iunction that measures the distance of two numbers. i.e.. u=aef, PI(t, S)=aPef(t, si)ls: E Sn) Essentially, a takes as parameter the comparison func Distance[A(t, s)=t[A-s[Al the tuple-to-tuple comparisons and generates a list of comparisor Such a function could be used to compare students based on their function values cf(t, s1),cf(t, sn), Vsi E S, which are then GPAs,i.e,Distance (GPAl(t, s), where t and s belong to Students, combined through a into a single value or to compare courses based on their units, i.e., Distance[ Units(t, s) An aggregation comparison function combines all partial values where t and s represent courses. into a final one using a function, such as the maximum, the average. Using conditional probabilities. An alternate way of computing and so forth, and essentially signifies the relationship of tuple t the similarity between two tuples is to use a measure that is based to the set of tuples s. P denotes other parameters to be used in of th the function. For example, we may wish to compute the weig that the other tuple has already been observed. For example, a sim- average of the partial comparison values using as weights the ple comparison function that computes conditional probability is of attribute A of the tuples in S
not impose any real constraints on its expressivity or on the relations and attributes that can be used with this operator because relations and attributes can be renamed accordingly. We can use a rename operator ρ[R 0 (B1, ...Bn)](R) that produces a relation identical to R but with name R 0 and attributes, in order, named B1, ... Bn. For example, we can rename PComments to Comments in order to have the relation Ext_Students as shown in Figure 3: Ext_Students := π{SuID,Name}(Students) ε ρ[Comments(PComments)] 3.4 The Recommend Operator Recommendations are based on comparisons (e.g., courses are rated by comparing their topics to a student’s interests, students are compared to a particular student based on their ratings in order to find similar students, and so forth). In order to “build” recommendations, we will have a library of comparison functions that implement common recommendation tasks of the form defined below. DEFINITION 3. A comparison function cf is a parameterized function that maps a tuple t to a single scalar value u by comparing it to another tuple s. u = cf[P](t, s) where t and s are the two inputs to be compared and P denotes the parameters used in the function. For example, we may wish to apply a comparison function to some part of two tuples, i.e., to an attribute A of them; then, we would specify cf[A](t, s). It could also be to a set of attributes for multi-criteria recommendations [4]. We do not make any particular assumptions for the tuples t and s, e.g., that they should have the same number of attributes or that corresponding attributes should have the same type, allowing in this way a wide spectrum of functions to be defined, as the forthcoming examples will illuminate. Also, we do not assume any properties for cf. It could be any function that computes a relationship between its two inputs based on either probabilistic approaches (e.g., [14]) or more traditional item-to-item correlations, such as Pearson’s correlation, Kendall’s tau, Euclidean distance, and Jaccard index, as well as user-defined, domain-specific metrics. It could also be a prediction model learnt offline (e.g., [8]). We illustrate several possibilities below. • Comparisons of string values. For example, we could consider a function that measures the Jaccard similarity between two string values, each one belonging to one of the input tuples of the function and treated, for the purposes of comparison, as a bag of words, i.e.: Jaccard[A](t, s) = |t[A] T s[A]| |t[A] S s[A]| Using this function, we could, for example, compare course descriptions using Jaccard[Description](t, s), where t and s represent courses, or compare instructors on the basis of their names with Jaccard[Name](t, s), where t and s are two tuples in Instructors. • Comparisons of numerical values. For example, we could use a function that measures the distance of two numbers, i.e.: Distance[A](t, s) = t[A] − s[A] Such a function could be used to compare students based on their GPAs, i.e., Distance[GPA](t, s), where t and s belong to Students, or to compare courses based on their units, i.e., Distance[Units](t, s), where t and s represent courses. • Using conditional probabilities. An alternate way of computing the similarity between two tuples is to use a measure that is based on the conditional probability of observing one of the tuples given that the other tuple has already been observed. For example, a simple comparison function that computes conditional probability is: Probability[A, B, R](t, s) = |σR.A=t[A] (R) ./B σR.A=s[A] (R)| |σR.A=s[A] (R)| This function computes the number of pairs of tuples from a relation R with the same value on an attribute B and with t[A] as the value of the attribute A in one of the pair’s tuples and s[A] as the value in the other pair tuple divided by the number of tuples in R with s[A] as the value of the attribute A. For example, Probability[CourseID, SuID, StudentHistory](t, s) finds the similarity between two courses based on the number of students (identified by their SuID) that took both courses t and s (identified by their course id CourseID) divided by the total number of students that took s. • Comparisons of extended values. We could also define functions for comparing two extended values, each one belonging to one of the input tuples, using some metric, such as the Euclidean distance, Pearson coefficient, etc. For example, a comparison function using the Euclidean distance is: Euclidean[E, A1, A2](t, s) = vuuuut X i∈t[E] j∈s[E] i[A1]=j[A1] (i[A2] − j[A2])2 This function compares two tuples t and s on their extended attribute E by summing up the squared differences of the values of attribute A2 in all tuples i and j from the nested relations t[E] and s[E], respectively, that join on A1. Then, we could, for instance, use Euclidean[Comments, CourseID, Rating](t, s) to compare students based on their common ratings, where students t and s are taken from Ext_Students. • Comparisons of single values to extended values. We could define a comparison function that compares two values of different types, e.g., a single value to an extended value, i.e., to a relation E. For example, we could define a function that tries to locate a single attribute value within E, and if it finds it, it returns the value of an attribute B in the tuple where the value was found, i.e.: Identify[A, E, B](t, s) = v[B] if ∃v ∈ s[E] s.t. t[A] = v[A] For instance, if t is a course and s is a student from Ext_Students, Identify[CourseID, Comments, Rating](t, s) returns the rating of student s for course t based on the course id. In Section 3.6, we will see examples that illustrate how these functions can be used for generating recommendations. Comparison functions compare one tuple to another tuple. It is frequently desirable to compare one tuple to a set of tuples, e.g., a student to a group of students. For this purpose, we define an aggregation comparison function. DEFINITION 4. An aggregation comparison function a is a parameterized function that maps a tuple t to a single scalar value u by comparing it to a set of tuples S: u = a[cf, P](t, S) = a[P]({cf(t, si)|si ∈ S}) Essentially, a takes as parameter the comparison function to use for the tuple-to-tuple comparisons and generates a list of comparison function values cf(t, s1), ...cf(t, sn), ∀si ∈ S, which are then combined through a into a single value. An aggregation comparison function combines all partial values into a final one using a function, such as the maximum, the average, and so forth, and essentially signifies the relationship of tuple t to the set of tuples S. P denotes other parameters to be used in the function. For example, we may wish to compute the weighted average of the partial comparison values using as weights the values of attribute A of the tuples in S:
an2 a course only with courses offered in previous terms. Another pos- sibility is a select operator that may follow in order to filter the out ut of a recommend operator, e. g, selecting only tuples with score bove a threshold. In these cases, we will use Dcf. a c as a short- hand. In what follows, for simplicity, we omit cf, a, and c from the operator notation whenever they are understood or not required (e.g, when there is no filtering), and we simply write RiDRj 3.5 The blend operator Often, we may want to combine recommendations generated (a) different recommendations through two different processing paths into one. For example, we FrendCourses5ReqCourses CourseID T幽see may have a set of courses that can be recommended based on the Advanc tudent history and courses suggested by friends or courses that are required for graduating, and we want to provide one recommenda- tion. For this purpose, we introduce the blend operator. put relations, Ri and Ri, augmented with an attribute with default name bscore, as follow R: BM R: =I(r,u) RUR AU: =MRi, r(r)) FriendCourses B Req Courses Tuples in the input relations should be union-compatible. In a sense, the blend operator is an advanced union, which does not only combine the tuples from its input relations but also augments each tuple with an attribute whose value is computed by taking into account the tuple in perspective with the tuples coming from 07 both relations using a blending method M. the name of the new attribute can be different from the default by renaming (c)normalized blending Example: Figure 4(a) shows two sets of courses, which have been generated in different ways: the set Req Courses is recommended FriendCourses 8 Req course based on degree requirements and the set Friend Courses is recom- mended by a students friends. A course in either set is described by the course code, title and recommendation score. The attributes may have been renamed before the relations are presented in the c1c778 blend operator using a rename operator. A blending method Occurrences (A, Ri, Ri counts the occur- rences of each tuple, identified by the attribute A, in both relations. Figure 4(b) shows the result of mixing the sets of courses using Figure 4: Blending examples Occurrences[CourselD, ReqCourses, Friend Courses]. For example, the Advanced Programming course is found in both sets in tuples ti The Occurrences blending method computes the same si[A]*cf(t, si) value for both tuples, i.e., Occurrences(t1)= Occurrences(ta) W_Avglcf, A(t, S) 2. This may not occur with other methods, like the next one ∑s[4 A Norm_Blend [B, wi, w,, Ri, Ril method computes a bscore for each tuple by normalizing the value of its attribute B and weight ing it using a real number in 0,1] ltioned earlier, (aggregation)comparison functions, like the e illustrate here, will belong to a library of functions to be Norm_Blend[B, w'i,wj, Ri, R,l()=BP *tift∈R1 *t;ift∈R DEFINITION 5. Recommend, Def, a, outputs the tuples of a re- For the two course sets, we could use Norm_Blend(score, 0.7.1 lation Ri augmented with an attribute with default name score, whose value for each tuple is computed by comparing this tuple the second set because satisfying the requirements is more impor with the tuples of a relation Ri, as follows tant. In the combined set of courses shown in Figure 4(c). the Ad- vanced Programming course now appears with different bscores R1 Def, a R:={(r,v)r;∈R4∧v:=al](r,B)} 0.7 = 0.7 in ti and a to*1= 1.0 in ta. On the other hand, the Graphics course is highly recommended by the students In other words, the score value of each tuple ri in Ri is produced friends but it has a lower score according to the requirements: a 9 in y comparing ri to each tuple in R, using function cf and then ts in contrast to a 7 in tb. However, due to giving different weight b aggregating using function a to the two sets, the bscore for ts is Io*0.7=0.63 and for tb is The new attribute can be renamed. if so desired. Furthermore sidered for each tuple in Ri. For example, we may want to compare for each tuple, identified by an attribute A in the two input relations
FriendCourses CourseID Title Score ReqCourses t1 t2 t3 ta tb tc C2 Advanced Programming 10 C5 AI Techniques 7 C10 Graphics 9 CourseID Title Score C2 Advanced Programming 10 C10 Graphics 7 C22 Compilers 8 (a) different recommendations CourseID Title Score C2 Advanced Programming 10 C5 AI Techniques 7 C10 Graphics 9 FriendCourses ß ReqCourses C10 Graphics 7 C22 Compilers 8 t1, ta t2 t3 tb tc Bscore 2 1 2 2 1 (b) occurrence-based blending CourseID Title Score C2 Advanced Programming 10 C5 AI Techniques 7 C10 Graphics 9 FriendCourses ß ReqCourses C2 Advanced Programming 10 C10 Graphics 7 C22 Compilers 8 t1 t2 t3 ta tb tc Bscore 0.7 0.49 0.63 1 0.7 0.8 (c) normalized blending CourseID Title Score C2 Advanced Programming 10 C5 AI Techniques 7 C10 Graphics 9 FriendCourses ß ReqCourses C10 Graphics 7 t1 t2 t3 ta, tb Bscore 10 2.88 7.82 7.82 tc C22 Compilers 8 4.7 (d) weighted average blending Figure 4: Blending examples W_Avg[cf, A](t, S) = X si ∈ S si[A] ∗ cf(t, si) X si ∈ S si[A] As mentioned earlier, (aggregation) comparison functions, like the ones we illustrate here, will belong to a library of functions to be used with the recommend operator. DEFINITION 5. Recommend, .cf,a, outputs the tuples of a relation Ri augmented with an attribute with default name score, whose value for each tuple is computed by comparing this tuple with the tuples of a relation Rj , as follows: Ri .cf,a Rj := {(ri, v)|ri ∈ Ri ∧ v := a[cf](ri, Rj )} In other words, the score value of each tuple ri in Ri is produced by comparing ri to each tuple in Rj using function cf and then aggregating using function a. The new attribute can be renamed, if so desired. Furthermore, we may specify a condition on which tuples from Rj will be considered for each tuple in Ri. For example, we may want to compare a course only with courses offered in previous terms. Another possibility is a select operator that may follow in order to filter the output of a recommend operator, e.g., selecting only tuples with score above a threshold. In these cases, we will use .cf,a,c as a shorthand. In what follows, for simplicity, we omit cf, a, and c from the operator notation whenever they are understood or not required (e.g., when there is no filtering), and we simply write Ri.Rj . 3.5 The Blend Operator Often, we may want to combine recommendations generated through two different processing paths into one. For example, we may have a set of courses that can be recommended based on the student history and courses suggested by friends or courses that are required for graduating, and we want to provide one recommendation. For this purpose, we introduce the blend operator. DEFINITION 6. Blend, βM, outputs the tuples from its two input relations, Ri and Rj , augmented with an attribute with default name bscore, as follows: Ri βM Rj := {(r, v)|r ∈ Ri ∪ Rj ∧ v := M[Ri, Rj ](r)} Tuples in the input relations should be union-compatible. In a sense, the blend operator is an advanced union, which does not only combine the tuples from its input relations but also augments each tuple with an attribute whose value is computed by taking into account the tuple in perspective with the tuples coming from both relations using a blending method M. The name of the new attribute can be different from the default by renaming. Example: Figure 4(a) shows two sets of courses, which have been generated in different ways: the set ReqCourses is recommended based on degree requirements and the set FriendCourses is recommended by a student’s friends. A course in either set is described by the course code, title and recommendation score. The attributes may have been renamed before the relations are presented in the blend operator using a rename operator. • A blending method Occurrences[A,Ri, Rj ] counts the occurrences of each tuple, identified by the attribute A, in both relations. Figure 4(b) shows the result of mixing the sets of courses using Occurrences[CourseID, ReqCourses, FriendCourses]. For example, the Advanced Programming course is found in both sets in tuples t1 and ta. The Occurrences blending method computes the same value for both tuples, i.e., Occurrences(t1) = Occurrences(ta) = 2. This may not occur with other methods, like the next one. • A Norm_Blend[B, wi, wj ,Ri, Rj ] method computes a bscore for each tuple by normalizing the value of its attribute B and weighting it using a real number in [0, 1]: Norm_Blend[B, wi, wj , Ri, Rj ](t) = ( t[B] max B ∗ wi if t ∈ Ri, t[B] max B ∗ wj if t ∈ Rj . For the two course sets, we could use Norm_Blend[Score, 0.7, 1, FriendCourses, ReqCourses], which gives more weight to the courses in the second set because satisfying the requirements is more important. In the combined set of courses shown in Figure 4(c), the Advanced Programming course now appears with different bscores: a 10 10 ∗ 0.7 = 0.7 in t1 and a 10 10 ∗ 1 = 1.0 in ta. On the other hand, the Graphics course is highly recommended by the student’s friends but it has a lower score according to the requirements: a 9 in t3 in contrast to a 7 in tb. However, due to giving different weight to the two sets, the bscore for t3 is 9 10 ∗ 0.7 = 0.63 and for tb is 7 10 ∗ 1.0 = 0.7. • A Wavg_Blend[A, B, wi, wj ,Ri, Rj ] method computes a bscore for each tuple, identified by an attribute A in the two input relations
as the weighted average of the tuple values in the attribute B that will help her improve her knowledge on these topics. There- 1*[B]+;*[B fore, she asks for suggestions on courses based on her course his- Wavg_Blend(A, B, wi, wj, Ri, Ril(t) tory. Course similarity will be determined based on the course de- scriptions. Each candidate course obtains the minimum Jaccard using Wavg_Blend(CourselD, Score, 0.7, 1, Friend Courses, similarity score when compared to the list of Alice's courses,i.e ve can take into account that the Advanced Program- we will use the recommend operator D Jaccard(DescriptionI ming course is recommended both based on the requirements and by friends and give it a high score 90.7+1. 00=10, as Figure Candidate Courses Year-200s( Cour 4(d) illustrates As in the case of comparison functions, blending methods will be part of an external library of possible blending methods, so that Example 2 is a workflow for content-based recommendations the application designer does not have to code them Example 3( Nearest-neighbor collaborative filtering ) Alice best friend is Joanne(with id 444)and Alice is curious to see what 3.6 Recommendation Workflows courses would be recommended to her friend by her colleagues The recommend and blend operators capture the essence of most tem allows friends to see each other's recommendations). We could commendation approaches and they can be composed and com bined with the more traditional select, project, join operators to use the inverse Euclidean (inv Euclidean) function, so that stu- describe recommendations dents with more similar ratings will have higher weight. These students could be found using inv_Euclidean(Comments, CourseID, Ratingl DEFINITION 7. A recommendation workflow(query) is a di- Ext_- Students: Students E T(Sull rected acyclic graph of interconnected operators that describes how This Student:= sulD=444(Ext_Students) recommendations are generated. Other_Students: oSulD<>444(Ext_ Students) Below, we present several examples of workflows that illumi- nate the new recommendation capabilities and the expressivity of could the the framework. These workflows describe recommendations that the ratings provided by these students. The recommend operator are hypothetically requested by a student, Alice. We will show ho applied uses the Identify comparison function to extract the user mmon recommendation strategies(found in literature)can be ex- ratings given to each course and aggregates them into one recom- essed as parameterized workflows with our operators, as well as mendation score per course using the W_Avg Score] aggregation illustrate some of the new recommendation possibilities. comparison function, where the scores of the students computed by Clearly, we do not expect that Alice or any user will describe de- the previous recommend operator are used to weigh their ratings sired recommendations at this level rather users will use appropri- ate user interfaces. For the sake of presentation, in each example This workfow captures nearest-neighbor collaborative filtering we first explain the kind of comparison expected and how this is We could use other functions in place of the inverse Euclidean, captured in the recommend operator, and then, in the workflows, such as the Pearson and also replace the W_Avg with e.g,the we use only the symbol of the recommend operator without any eighted average of rating deviations from the neighbors mean parameters since they will be understood in the context of each rating. Then, our workflow would represent the GroupLens collab- example. Also, we do not show some common sense operations, orative filtering approach [22 Note that it is easy to parameterize such as excluding from the set of recommended courses all courses our workflow(unlike other approaches), and hence it is possible to already taken by Alice, or selecting the courses with the highest generate recommendation for a person(Alice) assuming that they recommendation scores are intended for a different student(e. g Joanne) ample 1(Related courses ) Alice is browsing the page of a At this point, we observe that the recommend operator treats stu- course in the system with title"Programming: Part One"(and id dent ratings as a(virtual)attribute of students(namely, Comments ) C22). We would like to show other related courses that are also in the same way that it would do with base attributes, such as the offered in 2008. Assume that the comparison function used for description of courses in the previous example. However, in our string values is Jaccard. Then, similar courses could be found relational database, student ratings are stored separately from the using the recommend operator D Jaccard[Title as follows: rest of information regarding students, so we would have to list as This Course: = a inputs of the recommend operator multiple relations. The recom- mend operator would also have to know how to join these multiple Related Courses : Candidate Courses b This Course inputs and group ratings that refer to the same student making it elatedCourses contains courses with a score attribute that report cumbersome to define the operator. In order to enable our recom- the strength of the recommendation based on how similar a course mend operator to operate on attributes of entities transparently, i.e is to the one currently browsed by Alice. For example, it could con- irrespective of whether these are base attributes, such the GPA, or extended""attributes. such as a set of ratings we need the extend g: Part Two with score 0.5, and a course named"Advanced Programming method perator. In our example workflow, students are extended with an 0.2. This example demonstrates the use of attribute Comments that groups their course ratings, so that the set of ratings for each student can be "viewed as another attribute of the single recommend operator for comparing a set of tuples to a sin gle tuple. The next example shows how the recommend operator tudent in the workflow can be used for comparing a set of tuples to another set of tuples. We can also easily define novel types of recommendations. Examples with multiple recommend and blend operators follow the following examples illustrate Example 2(Content-based recommendations ) Alice(with Example 4(Many recommend and blend operators: Alice id 1234)has taken some courses on related topics(e. g, on litera- wants recommendations from students that are similar to her friend ture, writing, etc)and she is interested in courses offered this year as before but she also wants them to be as good as she is based
as the weighted average of the tuple values in the attribute B. Wavg_Blend[A, B, wi, wj , Ri, Rj ](t) = wi ∗ t[B] + wj ∗ t[B] wi + wj For example, using Wavg_Blend[CourseID, Score, 0.7, 1, FriendCourses, ReqCourses], we can take into account that the Advanced Programming course is recommended both based on the requirements and by friends and give it a high score 0.7∗10+1.0∗10 0.7+1.0 = 10, as Figure 4(d) illustrates. As in the case of comparison functions, blending methods will be part of an external library of possible blending methods, so that the application designer does not have to code them. 3.6 Recommendation Workflows The recommend and blend operators capture the essence of most recommendation approaches and they can be composed and combined with the more traditional select, project, join operators to describe recommendations. DEFINITION 7. A recommendation workflow (query) is a directed acyclic graph of interconnected operators that describes how recommendations are generated. Below, we present several examples of workflows that illuminate the new recommendation capabilities and the expressivity of the framework. These workflows describe recommendations that are hypothetically requested by a student, Alice. We will show how common recommendation strategies (found in literature) can be expressed as parameterized workflows with our operators, as well as illustrate some of the new recommendation possibilities. Clearly, we do not expect that Alice or any user will describe desired recommendations at this level rather users will use appropriate user interfaces. For the sake of presentation, in each example, we first explain the kind of comparison expected and how this is captured in the recommend operator, and then, in the workflows, we use only the symbol of the recommend operator without any parameters since they will be understood in the context of each example. Also, we do not show some common sense operations, such as excluding from the set of recommended courses all courses already taken by Alice, or selecting the courses with the highest recommendation scores. Example 1 (Related courses): Alice is browsing the page of a course in the system with title “Programming: Part One” (and id C22). We would like to show other related courses that are also offered in 2008. Assume that the comparison function used for string values is Jaccard. Then, similar courses could be found using the recommend operator .Jaccard[Title] as follows: ThisCourse := σCourseID=C22(Courses) CandidateCourses := σYear=2008(Courses) RelatedCourses := CandidateCourses . ThisCourse RelatedCourses contains courses with a score attribute that reports the strength of the recommendation based on how similar a course is to the one currently browsed by Alice. For example, it could contain a course named “Programming: Part Two” with a (Jaccard) score 0.5, and a course named “Advanced Programming Methodology” with a score 0.2. This example demonstrates the use of a single recommend operator for comparing a set of tuples to a single tuple. The next example shows how the recommend operator can be used for comparing a set of tuples to another set of tuples. Examples with multiple recommend and blend operators follow. Example 2 (Content-based recommendations): Alice (with id 1234) has taken some courses on related topics (e.g., on literature, writing, etc) and she is interested in courses offered this year that will help her improve her knowledge on these topics. Therefore, she asks for suggestions on courses based on her course history. Course similarity will be determined based on the course descriptions. Each candidate course obtains the minimum Jaccard similarity score when compared to the list of Alice’s courses, i.e., we will use the recommend operator .Jaccard[Description],min: AliceCourses := σSuID=1234(StudentHistory) ./ Courses CandidateCourses := σYear=2008(Courses) RecCourses := CandidateCourses . AliceCourses Example 2 is a workflow for content-based recommendations. Example 3 (Nearest-neighbor collaborative filtering): Alice’s best friend is Joanne (with id 444) and Alice is curious to see what courses would be recommended to her friend by her colleagues with similar tastes, i.e., with similar ratings as Joanne (say the system allows friends to see each other’s recommendations). We could use the inverse Euclidean (inv_Euclidean) function, so that students with more similar ratings will have higher weight. These students could be found using .inv_Euclidean[Comments,CourseID,Rating] : Ext_Students := Students ε π{SuID,CourseID,Rating}(Comments) ThisStudent := σSuID=444(Ext_Students) Other_Students := σSuID<>444(Ext_Students) RStds := Other_Students . ThisStudent Then, courses could be rated by taking the weighted average of the ratings provided by these students. The recommend operator applied uses the Identify comparison function to extract the user ratings given to each course and aggregates them into one recommendation score per course using the W_Avg[Score] aggregation comparison function, where the scores of the students computed by the previous recommend operator are used to weigh their ratings. Coll_Courses := Courses.Identify[CourseID,Comments,Rating],W_Avg[Score]RStds This workflow captures nearest-neighbor collaborative filtering. We could use other functions in place of the inverse Euclidean, such as the Pearson, and also replace the W_Avg with e.g., the weighted average of rating deviations from the neighbor’s mean rating. Then, our workflow would represent the GroupLens collaborative filtering approach [22]. Note that it is easy to parameterize our workflow (unlike other approaches), and hence it is possible to generate recommendation for a person (Alice) assuming that they are intended for a different student (e.g., Joanne). At this point, we observe that the recommend operator treats student ratings as a (virtual) attribute of students (namely, Comments), in the same way that it would do with base attributes, such as the description of courses in the previous example. However, in our relational database, student ratings are stored separately from the rest of information regarding students, so we would have to list as inputs of the recommend operator multiple relations. The recommend operator would also have to know how to join these multiple inputs and group ratings that refer to the same student making it cumbersome to define the operator. In order to enable our recommend operator to operate on attributes of entities transparently, i.e., irrespective of whether these are base attributes, such the GPA, or “extended” attributes, such as a set of ratings, we need the extend operator. In our example workflow, students are extended with an attribute Comments that groups their course ratings, so that the set of ratings for each student can be “viewed” as another attribute of the student in the workflow. We can also easily define novel types of recommendations, as the following examples illustrate. Example 4 (Many recommend and blend operators): Alice wants recommendations from students that are similar to her friend as before but she also wants them to be as good as she is based on
Dinw Euclidean(comments, Coursed, Rating Course Comments Figure 5: Recommendation workflow GPA. So, continuing the previous example from the Note how we can parameterize the workflow and let the selec- have been found we also want students that have si with tion conditions being determined at run time so that the workflow Alice. As a measure of student similarity, we can inverse compares any student to any group of students distance of their GPAs(inv Distance), as follows: Example 7(Recommending a major): We can easily devise Alice Student asulD-1234(Students) a new workflow that recommends a major to Alice based on her RStds2 : Other Students Pinv_Distance[GPA) Alice Student course history and so far performance(in fact, we are currently in- we can compute an overall score for students taking into tegrating recommendations for majors into our system. )We first how similar they are to both Joanne and Alice using the need to find students that have already selected a major and com- ng method Wavg_Blend(SulD, Score, L, 1, RStds, RStds2 pare their course selections and performance to Alices. We will QStds: RStds Bwavg_ Blend RStds2 Finally, course recommendations can be obtained in the sam MajStudents : T(SuD)( or"(StudentStudies Dd Study Programs)) way as when computing Coll_Courses before, with the only difference Ext_ MajStudents: MajStudents E T(SuD, CouseD, Grade)(StudentHistory) lat bscores will play the role of weig of the AliceStudent =aSulD-1234(Students)E T(SuD, CourslD, Grade)(StudentHistory) weighted average of student ratings Then, we will rate majors following a voting scheme. We will sum up the scores of the students that have followed each major. Consequently, this example query involves a total of three rec- RecMajors : Study Programs IDentify Study Prg/D, Study PrgID, Scovel, sum RStds ommend operators plus a blend operator. Finally, let us see recommendations from a different domain Example 5( Blending recommendations ) Assume we want to blend course recommendations based on Alice's course history Erample8(Item-to-item movie recommendations ):Alice decided to visit her favorite online movie site. Flex MDB. and down- (RecCourses)with those from similar students( Coll_Courses)giving load a movie. The movie site keeps the following database more weight to the latter in order to increase diversity in the final recommendations. We could, for example, use a blending method Wavg_Blend( CourselD, Score, 0.7, 1, Rec Courses, ColL_Courses Say FlexMDB has adopted FlexRecs and offers various customiz- Mixed Courses: RecCourses Bwavg_Blend Coll_Courses able recommendations. Among them, it offers recommendations dopting the Amazon item-to-item collaborative filtering approach Using our framework, one can define workflows that go beyond described in (17). At a high-level, item-to-item collaborative filter course recommendations, as the upcoming examples show. ing consists of two distinct components: the first component builds Example 6( Classification ) Honors students achieve high grades a model that captures the relations between the different items, and their course work and are often rewarded for their achievements the second component applies this pre-computed model to derive The university wants to keep track of any student with an excep- the recommendations for an active user[23]. The offline part builds tional performance. We would like to notify the program adminis- a similar-items table by finding items that customers tend to pur- trators of students that may qualify as honor students. To see if Al hase together, as follows: ice qualifies, we compare her to the honors students(having statu MoviePairs T(Mid, Mid2)(Saw bovid P(Saw(Vid, Mid2, Rating2)(Saw)) H in our data) based on the courses taken and grades earned. We could use again the inverse Euclidean(inv Euclidean) to com- Ext_Movie2 Movie Pairs E P(Saw(Vid, Mid2, Rating)](Saw) pute student-to-student similarity and take Alice's average similar MoviePairs keeps the pairs of movies seen togethe ity to all honors students as a score for her. Hence we will use the relation Ext_Moviel(Mid, Saw(Vid, Rating), Mid2) keeps all the ratings iven by viewers for the first movie of a pair and Ext_Movie keeps ,"H"(Students)E(SulD, CourseD, Grade)( StudentHistory) the ratings for the second movie of a pair. To compute similar lice Student : aSulD-1234(Students)E T(SulD, CourselD, Grade)(StudentHistory) mpare Ext_Moviel to Ext_Movie2 using the operator Dinv_ Euclidean(Saw, Vid, Rating). ( Mid, Mid2), which only
s Year=2008 Courses jaccard[Description],min Students e p {StudID, CourseID, Rating} Comments sStudID <> 444 s StudID = 444 Courses inv_Euclidean[Comments, CourseID, Rating] Identify[CourseID, Comments, Rating], W_Avg[Score] ßWavg_Blend Example 5 Example 2 Example 3 Courses StudentHistory s StudID=1234 Figure 5: Recommendation workflows GPA. So, continuing the previous example from the point that RStds have been found, we also want students that have similar GPA with Alice. As a measure of student similarity, we can take the inverse distance of their GPAs (inv_Distance), as follows: AliceStudent := σSuID=1234(Students) RStds2 := OtherStudents .inv_Distance[GPA] AliceStudent Then, we can compute an overall score for students taking into account how similar they are to both Joanne and Alice using the blending method Wavg_Blend[SuID, Score, 1, 1, RStds, RStds2]: QStds := RStds βWavg_Blend RStds2 Finally, course recommendations can be obtained in the same way as when computing Coll_Courses before, with the only difference that bscores will play the role of weights in the computation of the weighted average of student ratings: StricterRecs := Courses.Identify[CourseID,Comments,Rating],W_Avg[Bscore]QStds Consequently, this example query involves a total of three recommend operators plus a blend operator. Example 5 (Blending recommendations): Assume we want to blend course recommendations based on Alice’s course history (RecCourses) with those from similar students (Coll_Courses) giving more weight to the latter in order to increase diversity in the final recommendations. We could, for example, use a blending method Wavg_Blend[CourseID, Score, 0.7, 1, RecCourses, Coll_Courses]: MixedCourses := RecCourses βWavg_Blend Coll_Courses Using our framework, one can define workflows that go beyond course recommendations, as the upcoming examples show. Example 6 (Classification): Honors students achieve high grades in their course work and are often rewarded for their achievements. The university wants to keep track of any student with an exceptional performance. We would like to notify the program administrators of students that may qualify as honor students. To see if Alice qualifies, we compare her to the honors students (having status ’H’ in our data) based on the courses taken and grades earned. We could use again the inverse Euclidean (inv_Euclidean) to compute student-to-student similarity and take Alice’s average similarity to all honors students as a score for her. Hence, we will use the recommend operator .inv_Euclidean[StudentHistory,CourseID,Grade],Avg: HStudents := σstatus=“H”(Students) ε π{SuID,CourseID,Grade}(StudentHistory) AliceStudent := σSuID=1234(Students) ε π{SuID,CourseID,Grade}(StudentHistory) IsHStudent := AliceStudent . HStudents Note how we can parameterize the workflow and let the selection conditions being determined at run time so that the workflow compares any student to any group of students. Example 7 (Recommending a major ): We can easily devise a new workflow that recommends a major to Alice based on her course history and so far performance (in fact, we are currently integrating recommendations for majors into our system.) We first need to find students that have already selected a major and compare their course selections and performance to Alice’s. We will use .inv_Euclidean[StudentHistory,CourseID,Grade] . MajStudents := π{SuID}(σClassification=“major”(StudentStudies ./ StudyPrograms)) Ext_MajStudents := MajStudents ε π{SuID,CourseID,Grade}(StudentHistory) AliceStudent := σSuID=1234(Students) ε π{SuID,CourseID,Grade}(StudentHistory) RStds := Ext_MajStudents . AliceStudent Then, we will rate majors following a voting scheme. We will sum up the scores of the students that have followed each major. RecMajors := StudyPrograms .Identify[StudyPrgID,StudyPrgID,Score],sum RStds Finally, let us see recommendations from a different domain. Example 8 (Item-to-item movie recommendations): Alice decided to visit her favorite online movie site, FlexMDB, and download a movie. The movie site keeps the following database: Movies(Mid, Title, Year), Viewer(Vid, Name, Age), Saw(Vid, Mid, Rating) Say FlexMDB has adopted FlexRecs and offers various customizable recommendations. Among them, it offers recommendations adopting the Amazon item-to-item collaborative filtering approach described in [17]. At a high-level, item-to-item collaborative filtering consists of two distinct components: the first component builds a model that captures the relations between the different items, and the second component applies this pre-computed model to derive the recommendations for an active user [23]. The offline part builds a similar-items table by finding items that customers tend to purchase together, as follows: MoviePairs := π{Mid,Mid2}(Saw ./Vid ρ[Saw(Vid, Mid2, Rating2)](Saw) ) Ext_Movie1 := MoviePairs ε Saw Ext_Movie2 := MoviePairs ε ρ[Saw(Vid, Mid2, Rating)](Saw) MoviePairs keeps the pairs of movies seen together. The extended relation Ext_Movie1(Mid, Saw(Vid, Rating), Mid2)) keeps all the ratings given by viewers for the first movie of a pair and Ext_Movie2 keeps the ratings for the second movie of a pair. To compute similar movies, we compare Ext_Movie1 to Ext_Movie2 using the operator .inv_Euclidean[Saw,Vid,Rating],{Mid,Mid2}, which ensures that only paired
novies are compared, i.e., tuples from the two relations are com- pared if they have the same [ Mid, Mid2) Model T(Mid, Mid2, Score(Ext_Moviel D Ext- Movie2) Given the relation Model, the algorithm finds movies similar to those rated by the user, aggregates those items (we will sur recommend the most correlated items lice movies RecMovies: Alice Movies Score sum AliceMovies has all the movies that Alice has rated(identified by Mid )and movies similar to them(identified by Mid2). Recommended movies. i.e. RecMovies are selected from the similar movies. Figure 5 shows possible expression trees representing some of the example workflows above With FlexRecs, it is easy to define novel recommendations, since the new operators can be composed and combined with more tra- ditional operators to form recommendation workflows. To execute recommendation workflows, algebraic laws will help an optimizer to enumerate or transform plans in search of efficient ones for gen erating the desired recommendations. Figure 6: System Architecture Under the general operator semantics we assume, we cannot fined workflows(through an appropriate user interface)with differ make any assumptions for the general properties of all the oper- ent inputs and receive customized recommendations. The Work ators. For example, we cannot make any assumption w.r.t. the flow Manager hides the details of how flexible recommendation associativity or commutativity of any blend operator because its are generated, such as the details of the algorithms and structures ies depend on the semantics assumed in any given imple- used to implement the functionality of the operators that comprise on and, in particular, on the blending method used. For a recommendation workflow and the execution order of operators, xample, the bscores of tuples may depend on the order in which which can be different from the workflow definition in order to op- the input sets of tuples are presented to the operator(e.g, the timize the process ator may give more weight to the tuples of its first input relation) The Workfow Parser takes as input a workflow and constructs an Of course, in any particular realization of the framework, one may expression tree. like these shown in Figure 5, keeping the order of ive more specific or restricted semantics to some operator and be ble to define additional rules. For example, a blend operator based the operators as defined in the workflow, since no real processing on an Occurrences method will be always commutative. On the is performed at this level other hand, the extend and recommend operators are neither com The Recommendation Plan Generator takes as input an expres- mutative nor associative based on their definition There are laws sion tree and generates a recommendation execution plan. The gen- that can be derived directly from the definition of the operators. erated plan comprises a sequence of SQL statements and function For example, selection can be pushed through an extend to its first calls and specifies how and in what order operators are executed argument:a(RiE R,)=a(RiER and any results that are shared in a particular workflow and need to be materialized. This module leverages the underlying DB en- gine's query optimizer for implementing the new operators, and 4. SYSTEM ARCHITECTURE in particular, its ability to efficiently compute joins and aggrega- tions. In addition, it allows calling functions that implement various In this section, we present a prototype system that compiles and comparison functions and blending methods. When possible, these on top of an existing database system has a number of advantage functions are compiled into the SQL statements; in other cases ex- ternal functions can be called by the SQL statements. In this way, such as implementation ease, portability and faster deployment and the system can support new types of recommendations by extend. testing of the framework in our course planning tool. It also of- fers a proof of the framework's feasibility. Extending the database ing its library. In the next section, we describe in more detail the query engine with the processing capabilities required for support recommendation plan generation proces g flexible recommendations is the next step in order to try more The Recommendation Generator executes a plan and returns the sophisticated optimizations. A high level representation of the ngine, as- chitecture is shown in Figure 6. sembles any intermediate results and invokes the functions used in Our strategy for processing workflows reduces the amount of extended relations are never materialized. On the other hand,re- 4.1 Recommendation plan generati ults generated by any recommend or blend operators, which are The system builds a recommendation plan by traversing an ex later re-used in the same workflow are materialized for efficiency. pression tree from its root and going down to its leaves. To illustrate Our workflows are converted into sequences of SQL queries, whic the recommendation plan generation process and the implementa- are then executed by the database system. Thus, we can take advan tion of the new operators, we walk through the following examples tage of the underlying query opt uumizato n. The FlexRecs engine Example 3(cont'd): We return to Example 3(Sec. 3.6)and has the following parts consider its expression tree(Figure 5), which has two recommend The Workflow Manager allows a designer to define different rec- operators. We will first see the set of queries that can be generated ommendation workflows and end-users to invoke any of the de- for implementing the lower operator
movies are compared, i.e., tuples from the two relations are compared if they have the same {Mid, Mid2}. Model := π{Mid,Mid2,Score}(Ext_Movie1 . Ext_Movie2) Given the relation Model, the algorithm finds movies similar to those rated by the user, aggregates those items (we will sum up the similarity scores of the movies to the user’s movies) in order to recommend the most correlated items. AliceMovies := σVid=3232Saw ./Mid Model RecMovies := AliceMovies .Score,Sum,Mid2 AliceMovies AliceMovies has all the movies that Alice has rated (identified by Mid) and movies similar to them (identified by Mid2). Recommended movies, i.e., RecMovies, are selected from the similar movies. Figure 5 shows possible expression trees representing some of the example workflows above. With FlexRecs, it is easy to define novel recommendations, since the new operators can be composed and combined with more traditional operators to form recommendation workflows. To execute recommendation workflows, algebraic laws will help an optimizer to enumerate or transform plans in search of efficient ones for generating the desired recommendations. Under the general operator semantics we assume, we cannot make any assumptions for the general properties of all the operators. For example, we cannot make any assumption w.r.t. the associativity or commutativity of any blend operator because its properties depend on the semantics assumed in any given implementation and, in particular, on the blending method used. For example, the bscores of tuples may depend on the order in which the input sets of tuples are presented to the operator (e.g., the operator may give more weight to the tuples of its first input relation.) Of course, in any particular realization of the framework, one may give more specific or restricted semantics to some operator and be able to define additional rules. For example, a blend operator based on an Occurrences method will be always commutative. On the other hand, the extend and recommend operators are neither commutative nor associative based on their definition. There are laws that can be derived directly from the definition of the operators. For example, selection can be pushed through an extend to its first argument: σc(Ri ε Rj ) ≡ σc(Ri) ε Rj . 4. SYSTEM ARCHITECTURE In this section, we present a prototype system that compiles and executes FlexRecs workflows on top of a database. Implementation on top of an existing database system has a number of advantages, such as implementation ease, portability and faster deployment and testing of the framework in our course planning tool. It also offers a proof of the framework’s feasibility. Extending the database query engine with the processing capabilities required for supporting flexible recommendations is the next step in order to try more sophisticated optimizations. A high level representation of the architecture is shown in Figure 6. Our strategy for processing workflows reduces the amount of data saved in temporary relations during execution. For instance, extended relations are never materialized. On the other hand, results generated by any recommend or blend operators, which are later re-used in the same workflow are materialized for efficiency. Our workflows are converted into sequences of SQL queries, which are then executed by the database system. Thus, we can take advantage of the underlying query optimization. The FlexRecs engine has the following parts. The Workflow Manager allows a designer to define different recommendation workflows and end-users to invoke any of the deFunctions Data DB Engine Recommendation Generator e Methods ß Recommendation Plan Generator FlexRecs Workflow Parser Flexible recommendation engine Application Database Workflow Manager designer define workflows User Interface invoke workflow Users Figure 6: System Architecture fined workflows (through an appropriate user interface) with different inputs and receive customized recommendations. The Work- flow Manager hides the details of how flexible recommendations are generated, such as the details of the algorithms and structures used to implement the functionality of the operators that comprise a recommendation workflow and the execution order of operators, which can be different from the workflow definition in order to optimize the process. The Workflow Parser takes as input a workflow and constructs an expression tree, like these shown in Figure 5, keeping the order of the operators as defined in the workflow, since no real processing is performed at this level. The Recommendation Plan Generator takes as input an expression tree and generates a recommendation execution plan. The generated plan comprises a sequence of SQL statements and function calls and specifies how and in what order operators are executed and any results that are shared in a particular workflow and need to be materialized. This module leverages the underlying DB engine’s query optimizer for implementing the new operators, and in particular, its ability to efficiently compute joins and aggregations. In addition, it allows calling functions that implement various comparison functions and blending methods. When possible, these functions are compiled into the SQL statements; in other cases external functions can be called by the SQL statements. In this way, the system can support new types of recommendations by extending its library. In the next section, we describe in more detail the recommendation plan generation process. The Recommendation Generator executes a plan and returns the recommendations. It sends the SQL queries to the DB engine, assembles any intermediate results and invokes the functions used in the plan for comparing and blending tuples. 4.1 Recommendation Plan Generation The system builds a recommendation plan by traversing an expression tree from its root and going down to its leaves. To illustrate the recommendation plan generation process and the implementation of the new operators, we walk through the following examples. Example 3 (cont0 d): We return to Example 3 (Sec. 3.6) and we consider its expression tree (Figure 5), which has two recommend operators. We will first see the set of queries that can be generated for implementing the lower operator
Quer 1: Example 5(cont'd): We consider the expression tree corre- SELECT tt.SulD sponding to Example 5 in Figure 5. Recall that this tree corresponds FROM Comn to a workflow that combines recommendations based on a student's tt.CourselD-I2 Coursed AND L2. uD=444 AND t1 SulD e 444 history with course recommendations from other students. The rec- GROUP BY t1SulD ommendation plan for Example 3 is part of the plan for this exam- ple, with the only difference that the results of Query 3 in Figure 7 are materialized as an in-memory table(for the reasons explained CREATE TEMPORARY TABLE SELECT SUMscore'rating/SUM(score)As CScore earlier), and they are not ordered in this phase, since they are not the ⊥ FROM temp2, Couses final results yet. Let us that this materialized table is named SELECT t1. score WHERE temp2. CourselD-Courses CourseID OtherRecs and that a second materialized table, History Recs, hold FROM Comments t1, temp GROUP BY CouseD the recommendations based on the student history(corresponding to Example 2 in Figure 5). Query 4 in Figure 7 shows how the Query 4: blending method Wave_ Blend can be implemented with the help of SQL. Scores in Other Recs are given a greater weight than scores FROM SELECT Coursed, 0. 7CScore AS Score FROM History Recs UNIONALL in History Recs. Note that all queries are built on the fly based on (SELECT CourselD, 1.0"CScore AS Score FROM OeherRecs) the workflow. Hence, the weights are not hard-coded but they com- prise inputs of the blending method specified in the workflow. This query generates the final recommendations returned by the system. which may be additionally ordered for presentation purposes Figure 7: An example recommendation plan As we have seen in the examples above, for each recommend operator, the system builds a set of queries that implement the responding subtree contains select, project, and extend requested comparison function and group together any operators operators, which are all combined with the root recommend oper- that are found in the subtree under this recommend operator. We ator into one SQL query (Query 1 in Figure 7). This query has support an extensible library of (aggregation) comparison func- tions. Interestingly, a large number of functions, such as Pearson, several parts(shown shaded in Query 1), each one mapping to one Jaccard, Cosine, W_Avg, Avg can be impl ditions in the WhERE clause, (b)the recommend operator, which database(as e.g. in Query I in Figure 7. a q i mentedby com- the operators: (a)the select operators have been included as con ting and combining aggregations su y the underlying res students using the inverse Euclidean comparison functio ch cases, the part the recommendation plan that corresponds to the recommend oper on functions that are supported by the underlying database, (c)the atore mainly aggregate queries. The standard query oper extend operator is implemented by a GROUP BY clause. Observe ators, such as selections and projections, are mapped to appropriate how, in this example, the query does not join the relations Students SQL clauses, which are inserted into these queries Whenever there is an extend operator, no extended relation is ac- and Comments specified in the expression tree, because the sys em can recognize that all the attributes required by the extend and tually materialized in memory because that would require fetching recommend operators can be provided by the latter relation. This tribute, and executing a(possibly large) number of queries in order query creates a temporary in-memory table that contains two at- tributes for each student. the student id and a score to populate this attribute with joining tuples(e.g, courses)from tions based on the ratings provided by the similar users found in cessing, the joins implied by an extend operator are executed only The higher recommend operator rse recommenda- another relation. To save unnecessary 1/o operations and tuple pro- the previous step, ie it makes use of the materialized results of when tuples are actually requested by some upper in the expression the lower recommend operator. Queries 2 and 3 correspond to tree operator(typically a recommend operator) and the informa- this operator and use results generated by lower operators. Since tion related to a single entity is grouped together using SQL. For the result of the previous recommend operator has been generated instance, Query 1, which performs the necessary aggregations re- by aggregating the student ratings, we need again to associate stu- quired by the respective recommend operator, also realizes the ex- dents with their ratings in order to compute course recommenda- end operator that is in the subtree of the recommend operator as a tions combining student scores and ratings. Therefore, Query 2 In addition. as we have seen. if there are more than one recom- combines for each student the score and the ratings information mend or or if a recommend operator is followed by a blend one relation. Then, Query 3 contains a hidden extend operator in an expression tree, then a separate set of queries is built for each perator, which now uses a comparison function(Identify) of these operators. The results generated by the intermediate rec- ommend operators are materialized in order to avoid building com- and an aggregation comparison function(W_Avg), are again real- plicated, possibly nested, queries and reuse results of earlier com- ized by leveraging the database's existing aggregation capabilities The course recommendations may be ordered by their score for stations. The output of an intermediate recommend operator is presenting them to the user La. sedory relation with two attributes: a tuple id and a score Instead of having Query 2 generating an in y table and Blend operators are treated in a similar way as recommend ope tors. A set of queries is built that implement the requested blend equivalent,query.Splitting the computations in simpler queries and tions and projections, that are found in the subtree under a blend exploiting in-memory tables allows more efficient recommendation operator. We support an extensible set of blending methods. Again, we can leverage the expressivity of SQL joins and aggregations to The next example shows he the parts of the recommendation plan that refer to this operator. are materialized as in the case of the recommend operal on tree mented on top of a database For the sake of brevity, we only detail sults of blend operators that are internal nodes in an expressic
CREATE TEMPORARY TABLE temp SELECT t1.SuID, 1/SQRT(SUM((t1.Rating - t2.Rating) * (t1.Rating - t2.Rating))) as score FROM Comments t1, Comments t2 WHERE t1.CourseID = t2.CourseID AND t2.SuID = 444 AND t1.SuID <> 444 GROUP BY t1.SuID Query 1: CREATE TEMPORARY TABLE temp2 SELECT t1.*, score FROM Comments t1, temp WHERE t1.SuID = temp.SuID; Query 2: SELECT Courses.*, SUM(score*rating)/SUM(score) AS CScore FROM temp2, Courses WHERE temp2.CourseID=Courses.CourseID GROUP BY CourseID ORDER BY CScore Query 3: Query 4: SELECT Courses.*, SUM(Score )/1.7 AS BScore FROM ((SELECT CourseID, 0.7*CScore AS Score FROM HistoryRecs) UNION ALL (SELECT CourseID, 1.0*CScore AS Score FROM OtherRecs)) t1, Courses WHERE t1.CourseID=Courses.CourseID GROUP BY CourseID ORDER BY BScore Figure 7: An example recommendation plan The corresponding subtree contains select, project, and extend operators, which are all combined with the root recommend operator into one SQL query (Query 1 in Figure 7). This query has several parts (shown shaded in Query 1), each one mapping to one of the operators: (a) the select operators have been included as conditions in the WHERE clause, (b) the recommend operator, which compares students using the inverse Euclidean comparison function on their course ratings, is implemented by combining the aggregation functions that are supported by the underlying database, (c) the extend operator is implemented by a GROUP BY clause. Observe how, in this example, the query does not join the relations Students and Comments specified in the expression tree, because the system can recognize that all the attributes required by the extend and recommend operators can be provided by the latter relation. This query creates a temporary in-memory table that contains two attributes for each student: the student id and a score. The higher recommend operator computes course recommendations based on the ratings provided by the similar users found in the previous step, i.e., it makes use of the materialized results of the lower recommend operator. Queries 2 and 3 correspond to this operator and use results generated by lower operators. Since the result of the previous recommend operator has been generated by aggregating the student ratings, we need again to associate students with their ratings in order to compute course recommendations combining student scores and ratings. Therefore, Query 2 combines for each student the score and the ratings information into one relation. Then, Query 3 contains a hidden extend operator in its GROUP BY clause. The computations required by the recommend operator, which now uses a comparison function (Identify) and an aggregation comparison function (W_Avg), are again realized by leveraging the database’s existing aggregation capabilities. The course recommendations may be ordered by their score for presenting them to the user. Instead of having Query 2 generating an in-memory table and then Query 3 using this table, we could have generated a single, equivalent, query. Splitting the computations in simpler queries and exploiting in-memory tables allows more efficient recommendation processing over the DB engine that we use. The next example shows how the blend operator can be implemented on top of a database. For the sake of brevity, we only detail the parts of the recommendation plan that refer to this operator. Example 5 (cont0 d): We consider the expression tree corresponding to Example 5 in Figure 5. Recall that this tree corresponds to a workflow that combines recommendations based on a student’s history with course recommendations from other students. The recommendation plan for Example 3 is part of the plan for this example, with the only difference that the results of Query 3 in Figure 7 are materialized as an in-memory table (for the reasons explained earlier), and they are not ordered in this phase, since they are not the final results yet. Let us assume that this materialized table is named OtherRecs and that a second materialized table, HistoryRecs, holds the recommendations based on the student history (corresponding to Example 2 in Figure 5). Query 4 in Figure 7 shows how the blending method Wavg_Blend can be implemented with the help of SQL. Scores in OtherRecs are given a greater weight than scores in HistoryRecs. Note that all queries are built on the fly based on the workflow. Hence, the weights are not hard-coded but they comprise inputs of the blending method specified in the workflow. This query generates the final recommendations returned by the system, which may be additionally ordered for presentation purposes. As we have seen in the examples above, for each recommend operator, the system builds a set of queries that implement the requested comparison function and group together any operators that are found in the subtree under this recommend operator. We support an extensible library of (aggregation) comparison functions. Interestingly, a large number of functions, such as Pearson, Jaccard, Cosine, W_Avg, Avg can be implemented by computing and combining aggregations supported by the underlying database (as e.g., in Query 1 in Figure 7). In such cases, the part of the recommendation plan that corresponds to the recommend operator comprises mainly aggregate queries. The standard query operators, such as selections and projections, are mapped to appropriate SQL clauses, which are inserted into these queries. Whenever there is an extend operator, no extended relation is actually materialized in memory because that would require fetching tuples (e.g., students) in memory, augmenting them with a new attribute, and executing a (possibly large) number of queries in order to populate this attribute with joining tuples (e.g., courses) from another relation. To save unnecessary I/O operations and tuple processing, the joins implied by an extend operator are executed only when tuples are actually requested by some upper in the expression tree operator (typically a recommend operator) and the information related to a single entity is grouped together using SQL. For instance, Query 1, which performs the necessary aggregations required by the respective recommend operator, also realizes the extend operator that is in the subtree of the recommend operator as a GROUP BY clause. In addition, as we have seen, if there are more than one recommend operator or if a recommend operator is followed by a blend in an expression tree, then a separate set of queries is built for each of these operators. The results generated by the intermediate recommend operators are materialized in order to avoid building complicated, possibly nested, queries and reuse results of earlier computations. The output of an intermediate recommend operator is an in-memory relation with two attributes: a tuple id and a score. Blend operators are treated in a similar way as recommend operators. A set of queries is built that implement the requested blending method and also combine any standard operators, such selections and projections, that are found in the subtree under a blend operator. We support an extensible set of blending methods. Again, we can leverage the expressivity of SQL joins and aggregations to implement several methods, as illustrated in Query 4. Finally, results of blend operators that are internal nodes in an expression tree are materialized as in the case of the recommend operators