《电子商务 E-business》阅读文献：A flexible semantic inference methodology to reason about user preferences in knowledge-based recommender systems

团购合买资源类别：文库，文档格式：PDF，文档页数：16，文件大小：426.72KB

vailableonlineatwww.sciencedirect.com ScienceDirect Knowledge-Based SYSTEMS ELSEVIER Knowledge-Based Systems 21(2008)305-320 www.elsevier.com/locate/knosys a flexible semantic inference methodology to reason about user preferences in knowledge-based recommender systems Yolanda blanco- fernandez " Jose j. pazos-Arias Alberto Gil-Solla Manuel Ramos-Cabrer Martin LOpez-Nores, Jorge Garcia-Duque a Ana Fernandez-Vilas a, Rebeca P. Diaz-Redondo a, Jesus Bermejo-Munoz b ETSE Telecommicacion, Campus Universitario, Vigo 36310, Spain b Telent, Ronda de Tamarillo, 29, Sevilla 41006, spain Received 22 December 2006: received in revised form 18 June 2007: accepted 28 July 2007 Available online 9 August 2007 Abstract Recommender systems arose with the goal of helping users search in overloaded information domains(like e-commerce, e-learning or Digital TV). These tools automatically select items(commercial products, educational courses, TV programs, etc. )that may be appealing to each user taking into account his/her personal preferences. The personalization strategies used to compare these preferences with the available items suffer from well-known deficiencies that reduce the quality of the recommendations. Most of the limitations arise from using syntactic matching techniques because they miss a lot of useful knowledge during the recommendation process. In this paper, we propose a personalization strategy that overcomes these drawbacks by applying inference techniques borrowed from the Semantic Web Our approach reasons about the semantics of items and user preferences to discover complex associations between them. These semantic associations provide additional knowledge about the user preferences, and permit the recommender system to compare them with the available items in a more effective way. The proposed strategy is flexible enough to be applied in many recommender systems, regardless of their application domain. Here, we illustrate its use in AVATAR, a tool that selects appealing audiovisual programs from among the myriad available in Digital TV. C 2007 Elsevier B.v. All rights reserved Keywords: Recommender systems; Semantic Web: Ontologies; Semantic reasoning: Content-based filtering 1. Introduction overload is hard to digest for users. who have difficulties to find really relevant items(e.g. commercial products, educa n recent years, we have witnessed an exponential tional courses, TV programs). Recommender systems arose growth in the amount of information available in diverse in the middle 1990s to provide assistance in these searching domains(like e-commerce, e-learning or Digital TV). This tasks; to this aim, they automatically select those items that may be appealing to each user considering the prefe erences defined in his/her personal profile Work supported by the Spanish Ministry of Education and Science Projects TSl2004-03677 and TSl2007-61599, and by the Xunta de galicia The first recommendation strategies were the so-called project PGIDITO5PXI32204PN content-based methods[2, 44]. This technique suggests items g author.Tel:+34986813967;fax:+34986812100 similar to the ones that each user liked in the past. The sim- E-mail addresses: yolanda@det vigo. es (Y. Blanco-Fernandez), ilarity metrics employed limit the quality of the offered rec- se@det vigo.es (J.J. Pazos-Arias), agil@det. uvigo. HiI-Soll ommendations, because they are based on more or less os(adet. vigo. es (M. Ramos sophisticated matching techniques that miss a lot of knowl (A. Fernandez. vilas), rebecaadet uvigo es (R. P. Diaz-Redondo ), jesus. edge during the personalization process. In fact, due to bermejo@telventabengoa.com(J.Bermejo-Munoz). their syntactic nature, the existing metrics only detect 0950-7051/S-see front matter 2007 Elsevier B v. All rights reserved 6 knosys.2007.07.004

A flexible semantic inference methodology to reason about user preferences in knowledge-based recommender systems q Yolanda Blanco-Ferna´ndez a,*, Jose´ J. Pazos-Arias a , Alberto Gil-Solla a , Manuel Ramos-Cabrer a , Martı´n Lo´pez-Nores a , Jorge Garcı´a-Duque a , Ana Ferna´ndez-Vilas a , Rebeca P. Dı´az-Redondo a , Jesu´s Bermejo-Mun˜oz b a ETSE Telecomunicacio´n, Campus Universitario,Vigo 36310, Spain b Telvent, Ronda de Tamargillo, 29, Sevilla 41006, Spain Received 22 December 2006; received in revised form 18 June 2007; accepted 28 July 2007 Available online 9 August 2007 Abstract Recommender systems arose with the goal of helping users search in overloaded information domains (like e-commerce, e-learning or Digital TV). These tools automatically select items (commercial products, educational courses, TV programs, etc.) that may be appealing to each user taking into account his/her personal preferences. The personalization strategies used to compare these preferences with the available items suffer from well-known deficiencies that reduce the quality of the recommendations. Most of the limitations arise from using syntactic matching techniques because they miss a lot of useful knowledge during the recommendation process. In this paper, we propose a personalization strategy that overcomes these drawbacks by applying inference techniques borrowed from the Semantic Web. Our approach reasons about the semantics of items and user preferences to discover complex associations between them. These semantic associations provide additional knowledge about the user preferences, and permit the recommender system to compare them with the available items in a more effective way. The proposed strategy is flexible enough to be applied in many recommender systems, regardless of their application domain. Here, we illustrate its use in AVATAR, a tool that selects appealing audiovisual programs from among the myriad available in Digital TV. 2007 Elsevier B.V. All rights reserved. Keywords: Recommender systems; Semantic Web; Ontologies; Semantic reasoning; Content-based filtering 1. Introduction In recent years, we have witnessed an exponential growth in the amount of information available in diverse domains (like e-commerce, e-learning or Digital TV). This overload is hard to digest for users, who have difficulties to find really relevant items (e.g. commercial products, educational courses, TV programs). Recommender systems arose in the middle 1990s to provide assistance in these searching tasks; to this aim, they automatically select those items that may be appealing to each user considering the preferences defined in his/her personal profile. The first recommendation strategies were the so-called content-based methods [2,44]. This technique suggests items similar to the ones that each user liked in the past. The similarity metrics employed limit the quality of the offered recommendations, because they are based on more or less sophisticated matching techniques that miss a lot of knowledge during the personalization process. In fact, due to their syntactic nature, the existing metrics only detect 0950-7051/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2007.07.004 q Work supported by the Spanish Ministry of Education and Science Projects TSI2004-03677 and TSI2007-61599, and by the Xunta de Galicia project PGIDIT05PXI32204PN. * Corresponding author. Tel.: +34 986813967; fax: +34 986812100. E-mail addresses: yolanda@det.uvigo.es (Y. Blanco-Ferna´ndez), jose@det.uvigo.es (J.J. Pazos-Arias), agil@det.uvigo.es (A. Gil-Solla), mramos@det.uvigo.es (M. Ramos-Cabrer), mlnores@det.uvigo.es (M. Lo´pez-Nores), jgd@det.uvigo.es (J. Garcı´a-Duque), avilas@det.uvigo.es (A. Ferna´ndez-Vilas), rebeca@det.uvigo.es (R.P. Dı´az-Redondo), jesus. bermejo@telvent.abengoa.com (J. Bermejo-Mun˜oz). www.elsevier.com/locate/knosys Available online at www.sciencedirect.com Knowledge-Based Systems 21 (2008) 305–320

Y. Blanco-Fernandez et al. Knowledge- Based Systems 21(2008)305-320 similarity between items that share the same attributes or In order to discover as much knowledge about the user features. This causes overspecialized recommendations that preferences as possible, our approach looks for relation only include items very similar to those the user already ships hidden in the system ontology. Even though such knows relationships are already defined in the Semantic Web com- Instead of fighting the overspecialization of content- munity, they have never been included in a pe pased methods, researchers proposed new personalization environment like a recommender system. For that reason, strategies, such as collaborative filtering [21, 19, 37] and we develop a new inference methodology to take advantage hybrid approaches mixing both techniques [4, 9, 40]. To of the benefits of semantic reasoning in our recommenda diversify the offered recommendations, collaborative rec- tion strategy. The main features of this methodology are ommender systems suggest to each user items that were the following ones appealing to other users with similar tastes. Regardless of its success in many application domains, collaborative fil-. Firstly, it explores extensively the knowledge base and tering has two serious drawbacks. On the one hand, it discovers the hidden relationships from the entities(clas requires knowing many user profiles in order to elaborate ses and their instances), properties and hierarchical links ccurate recommendations for a given user. On the other explicitly formalized in it. Such relationships provide the s applicability and quality are limited by the so-called knowledge missed by the current approaches and permit sparsity problem, which occurs when the available data to compare the user preferences with the available items are insufficient for identifying similar users[9]. in a more effective way, beyond the traditional syntactic In order to develop an approach that provides high similarity metrics quality recommendations(even when data are sparse), this Secondly, our reasoning methodology ensures that the paper rescues the content-based methods by developing inferred relationships are relevant for the user, and that effective mechanisms against overspecialization. Specifi- they adapt in a flexible way as his/her preferences evolve cally, our approach overcomes this problem due to the over time. Thereby, our content-based strategy guaran- syntactic nature of the existing similarity metrics- by bor tees the diversity of the recommendations without risk rowing inference techniques from the Semantic Web. This ing their personalized nature initiative is based on describing the Web resources by meta-. Finally, our methodology is not exclusively joined to a data, so that computers can understand their semantics and specific domain, as it is flexible enough to be used in infer relationships between them. Such a reasoning pro diverse personalization applications. In this paper, we cess requires formalizing the semantic descriptions of the illustrate its use in AVATAR, a tool that selects appeal resources in a commonly agreed and reusable way; for that ing audiovisual programs from among the myriad avail purpose, the Semantic Web community resorts to ontolo able in Digital TV. gies, i.e. conceptualizations that identify typical concepts and relationships in a specific application domain. Con This paper is organized as follows: Section 2 reviews the cepts are identified by means of classes and relationships most well-known matching techniques used in traditional are represented as properties(both hierarchically orga- content-based approaches, as well as the proposals defined ties are included in this formal knowledge base. 3 proper- nized), and besides, specific instances of classes and in the Semantic Web for querying ontologies and inferring knowledge from them. Section 3 describes the two key ele Bearing in mind the results achieved in the Semantic ments of our semantic reasoning methodology, applied in Web, our proposal includes reasoning about the semantic the AVATAR system: the ontology about the tv domain descriptions of the items available in the recommender sys- and the user profiles that store their personal preferences tem(formalized in a domain ontology), and inferring The semantic relationships adopted in our approach and implicit semantic relationships between them. Such rela- the inference mechanism from the system knowledge base tionships diversify the recommendations because they are detailed in Sections 4 and 5, respectively. Section 6 allow establishing correspondences between the user pref- shows an example of the reasoning-based recommenda erences and other items appealing to him/her that do not tions elaborated by AVATAR. Section 7 compares our necessarily share the same features. In other words, our proposal with other similar approaches defined in the liter content-based strategy suggests items semantically related ature. Finally, Section 8 draws some conclusions and to those the user liked in the past, not just items similar describes our future work. to his/her preferences 2. Background I As the amount of products available in the collaborative recommender 2.1. Conventional matching techniques system increases, it is less likely that two users rate the same products in their profiles. Since there is no overlapping among their profiles, it is hard to find users with similar tastes As we mentioned in the previous section, content-based 2 We take reasoning as a synonym for inference recommender systems resort to matching techniques to The terms ontology and knowledge base are used without distinction. compare the user preferences to the available items. The

similarity between items that share the same attributes or features. This causes overspecialized recommendations that only include items very similar to those the user already knows. Instead of fighting the overspecialization of contentbased methods, researchers proposed new personalization strategies, such as collaborative filtering [21,19,37] and hybrid approaches mixing both techniques [4,9,40]. To diversify the offered recommendations, collaborative recommender systems suggest to each user items that were appealing to other users with similar tastes. Regardless of its success in many application domains, collaborative filtering has two serious drawbacks. On the one hand, it requires knowing many user profiles in order to elaborate accurate recommendations for a given user. On the other, its applicability and quality are limited by the so-called sparsity problem, which occurs when the available data are insufficient for identifying similar users1 [9]. In order to develop an approach that provides highquality recommendations (even when data are sparse), this paper rescues the content-based methods by developing effective mechanisms against overspecialization. Specifi- cally, our approach overcomes this problem – due to the syntactic nature of the existing similarity metrics – by borrowing inference techniques from the Semantic Web. This initiative is based on describing the Web resources by metadata, so that computers can understand their semantics and infer relationships between them. Such a reasoning2 process requires formalizing the semantic descriptions of the resources in a commonly agreed and reusable way; for that purpose, the Semantic Web community resorts to ontologies, i.e. conceptualizations that identify typical concepts and relationships in a specific application domain. Concepts are identified by means of classes and relationships are represented as properties (both hierarchically organized), and besides, specific instances of classes and properties are included in this formal knowledge base.3 Bearing in mind the results achieved in the Semantic Web, our proposal includes reasoning about the semantic descriptions of the items available in the recommender system (formalized in a domain ontology), and inferring implicit semantic relationships between them. Such relationships diversify the recommendations because they allow establishing correspondences between the user preferences and other items appealing to him/her that do not necessarily share the same features. In other words, our content-based strategy suggests items semantically related to those the user liked in the past, not just items similar to his/her preferences. In order to discover as much knowledge about the user preferences as possible, our approach looks for relationships hidden in the system ontology. Even though such relationships are already defined in the Semantic Web community, they have never been included in a personalization environment like a recommender system. For that reason, we develop a new inference methodology to take advantage of the benefits of semantic reasoning in our recommendation strategy. The main features of this methodology are the following ones: • Firstly, it explores extensively the knowledge base and discovers the hidden relationships from the entities (classes and their instances), properties and hierarchical links explicitly formalized in it. Such relationships provide the knowledge missed by the current approaches and permit to compare the user preferences with the available items in a more effective way, beyond the traditional syntactic similarity metrics. • Secondly, our reasoning methodology ensures that the inferred relationships are relevant for the user, and that they adapt in a flexible way as his/her preferences evolve over time. Thereby, our content-based strategy guarantees the diversity of the recommendations without risking their personalized nature. • Finally, our methodology is not exclusively joined to a specific domain, as it is flexible enough to be used in diverse personalization applications. In this paper, we illustrate its use in AVATAR, a tool that selects appealing audiovisual programs from among the myriad available in Digital TV. This paper is organized as follows: Section 2 reviews the most well-known matching techniques used in traditional content-based approaches, as well as the proposals defined in the Semantic Web for querying ontologies and inferring knowledge from them. Section 3 describes the two key elements of our semantic reasoning methodology, applied in the AVATAR system: the ontology about the TV domain and the user profiles that store their personal preferences. The semantic relationships adopted in our approach and the inference mechanism from the system knowledge base are detailed in Sections 4 and 5, respectively. Section 6 shows an example of the reasoning-based recommendations elaborated by AVATAR. Section 7 compares our proposal with other similar approaches defined in the literature. Finally, Section 8 draws some conclusions and describes our future work. 2. Background 2.1. Conventional matching techniques As we mentioned in the previous section, content-based recommender systems resort to matching techniques to compare the user preferences to the available items. The 1 As the amount of products available in the collaborative recommender system increases, it is less likely that two users rate the same products in their profiles. Since there is no overlapping among their profiles, it is hard to find users with similar tastes. 2 We take reasoning as a synonym for inference. 3 The terms ontology and knowledge base are used without distinction. 306 Y. Blanco-Ferna´ndez et al. / Knowledge-Based Systems 21 (2008) 305–320

Y. Blanco-Ferndndez et al. Knowledge-Based Systems 21(2008)305-320 most widely used techniques nowadays fall into two guages have been proposed in the Semantic Web community,such as RDF(S) [8], DAML [II] OIL [14] DAML t oil [12] and Owl [22]. OwL is currently the (i) Techniques based on cosine similarity: This approach most widely used solution because supports certain fea- represents each available item and the user profile tures missing in other formats, like disjointness and Bool- as two vectors of features/attributes(e. g. in the Tv ean combinations of classes, cardinality restrictions and domain, genre, topic, actors, directors, . ) which specific characteristics of properties (transitive, unique, have a set of possible values( inverse,.) war, traveling. > ) The similarity between both vectors is mats. The focus has been put on RDF and RDFS, since computed as the cosine of the angle between them these were the most extended formats before OWL [24, 4]. This approach only detects similarity when appeared. Despite the vast amount of existing approaches the considered item has exactly the same features (e.g. RQL [20], Squish QL [25], TRIPLE [38] RDQL [36) defined in the user profile, thus preventing any none of these languages support the inference of complex semantic reasoning process semantic relationships like those pursued in our approach (ii) Techniques based on automatic classifiers: Automatic In fact, the query paradigm offered in these formats methods, such as neural networks [71 decisio ning "Get all the instances related to A by means of the relation- [30] Bayesian networks [44], and association rules ings for two main reasons: (i) it forces the user to know in [39, 5, 24]. These are computational models that clas- detail the underlying ontology to specify correct queries, sify a given input in a specific category considering and (ii) it only considers relationships explicitly formalized a predefined training set. In the TV domain, this set in the knowledge base, thus missing others hidden in the past and their semantic attributes), the inputs we pursue have been only considered in the context of the are the features of a given TV program, and the ol research project SemDis. This project proposes an put is a category that determines if this program is approach for querying an RDF ontology and discovering appealing or unappealing for the user. For that pur- complex semantic associations between two instances A pose, the classifiers consider the occurrence patterns and B specified by the user [l]. In order to infer such asso- of the input features in the training set; thus, they ciations, the authors of [1] identify both instances in the only take into account the syntax of these attributes knowledge base and explores(successively) the chains of and disregard their meaning (i.e. their semantics) properties that start from them. Traversing these property sequences, their approach reaches new instances implicitly The syntactic nature of the aforementioned techniques related to those specified by the user. Several semantic cause the overspecialization present in the current content- associations are defined in [l] from these property based recommender systems. Our strategy overcomes that sequences and from the relationships between them limitation by considering the descriptions of the Tv pro- Our recommendation strategy adopts the associations grams and inferring semantic relationships between them. defined in Sem Dis because they favor the discovery of hid- These relationships provide the recommender system with den knowledge about the user preferences, and contribute additional knowledge about the user preferences and thus to more effective matching processes. These associations support more effective personalization processes. This and the relationships between the property sequences that knowledge allows to find TV programs appealing to the user originate them will be detailed in Section 4 Bat do not have the same attributes defined in his/her pro- Although our approach borrows the semantic associa- In order to carry out this semantic reasoning process, for discovering such associations in the RDF(S) ontology our recommender system requires capabilities for querying This is because the Sem Dis query paradigm does not con the classes, properties and instances defined in a knowledge sider the personalization requirements present in a recom ontology, and also for inferring from them the aforemen- mender system. Instead of a paradigm "Get all the semantic oned relationships. In the next section, we review some associations between A and B, our methodology requires a the existing approaches that tackle these issues, and also paradigm " Get the instances related to a and the semantic lentify the limitations that hinder their use in our reason- associations established between them. In AVATAR, A ing methodology refers to the Tv programs the user liked in the past, and the retrieved instances identify the programs finally sug 2.2. Oueries and inferences from onto 4 The query languages defined for OWL are valid for RDF(S)as well. In order to share the knowledge formalized in an onto 5 Additional information about SemDis ogy, a normalized format is required. Several ontology lan- lsdis.cs. uga. edu/Projects/SemDis

most widely used techniques nowadays fall into two categories: (i) Techniques based on cosine similarity: This approach represents each available item and the user profile as two vectors of features/attributes (e.g. in the TV domain, genre, topic, actors, directors,...), which have a set of possible values (, , , ). The similarity between both vectors is computed as the cosine of the angle between them [24,4]. This approach only detects similarity when the considered item has exactly the same features defined in the user profile, thus preventing any semantic reasoning process. (ii) Techniques based on automatic classifiers: Automatic classifiers are based on diverse machine learning methods, such as neural networks [7], decision trees [30], Bayesian networks [44], and association rules [39,5,24]. These are computational models that classify a given input in a specific category considering a predefined training set. In the TV domain, this set stores the user preferences (programs he/she liked in the past and their semantic attributes), the inputs are the features of a given TV program, and the output is a category that determines if this program is appealing or unappealing for the user. For that purpose, the classifiers consider the occurrence patterns of the input features in the training set; thus, they only take into account the syntax of these attributes and disregard their meaning (i.e. their semantics). The syntactic nature of the aforementioned techniques cause the overspecialization present in the current contentbased recommender systems. Our strategy overcomes that limitation by considering the descriptions of the TV programs and inferring semantic relationships between them. These relationships provide the recommender system with additional knowledge about the user preferences and thus support more effective personalization processes. This knowledge allows to find TV programs appealing to the user that do not have the same attributes defined in his/her pro- file, but rather other features semantically related to them. In order to carry out this semantic reasoning process, our recommender system requires capabilities for querying the classes, properties and instances defined in a knowledge ontology, and also for inferring from them the aforementioned relationships. In the next section, we review some of the existing approaches that tackle these issues, and also identify the limitations that hinder their use in our reasoning methodology. 2.2. Queries and inferences from ontologies In order to share the knowledge formalized in an ontology, a normalized format is required. Several ontology languages have been proposed in the Semantic Web community, such as RDF(S) [8], DAML [11], OIL [14], DAML + OIL [12] and OWL [22]. OWL is currently the most widely used solution because supports certain features missing in other formats, like disjointness and Boolean combinations of classes, cardinality restrictions and specific characteristics of properties (transitive, unique, inverse,...). Many authors have proposed different approaches for querying ontologies conforming to these normalized formats. The focus has been put on RDF and RDFS, since these were the most extended formats before OWL appeared.4 Despite the vast amount of existing approaches (e.g. RQL [20], SquishQL [25], TRIPLE [38], RDQL [36]), none of these languages support the inference of complex semantic relationships like those pursued in our approach. In fact, the query paradigm offered in these formats is: ‘‘Get all the instances related to A by means of the relationship R’’. This paradigm is not appropriate for our reasonings for two main reasons: (i) it forces the user to know in detail the underlying ontology to specify correct queries, and (ii) it only considers relationships explicitly formalized in the knowledge base, thus missing others hidden in it. To the best of our knowledge, the kind of relationships we pursue have been only considered in the context of the research project SemDis.5 This project proposes an approach for querying an RDF ontology and discovering complex semantic associations between two instances A and B specified by the user [1]. In order to infer such associations, the authors of [1] identify both instances in the knowledge base and explores (successively) the chains of properties that start from them. Traversing these property sequences, their approach reaches new instances implicitly related to those specified by the user. Several semantic associations are defined in [1] from these property sequences and from the relationships between them. Our recommendation strategy adopts the associations defined in SemDis because they favor the discovery of hidden knowledge about the user preferences, and contribute to more effective matching processes. These associations and the relationships between the property sequences that originate them will be detailed in Section 4. Although our approach borrows the semantic associations, we cannot reuse the mechanism defined in SemDis for discovering such associations in the RDF(S) ontology. This is because the SemDis query paradigm does not consider the personalization requirements present in a recommender system. Instead of a paradigm ‘‘Get all the semantic associations between A and B’’, our methodology requires a paradigm ‘‘Get the instances related to A and the semantic associations established between them’’. In AVATAR, A refers to the TV programs the user liked in the past, and the retrieved instances identify the programs finally sug- 4 The query languages defined for OWL are valid for RDF(S) as well. 5 Additional information about SemDis can be found in http:// lsdis.cs.uga.edu/Projects/SemDis. Y. Blanco-Ferna´ndez et al. / Knowledge-Based Systems 21 (2008) 305–320 307

308 Y. Blanco-Fernandez et al. Knowledge- Based Systems 21(2008)305-320 gested. In order to guarantee the computational viability, The properties are the key elements of our semantic rea we use a controlled inference mechanism that explores soning methodology. Specifically, our approach explores the knowledge base by selecting instances significant to both these properties and the instances(and classes) joined the user and omitting those which are totally irrelevant. by them, with the goal of uncovering meaningful semantic Thanks to this mechanism, our methodology guarantees associations hidden in the knowledge base. To this aim, we that: (i)the discovered associations find programs appeal- adopt the notion of property sequence defined in the Sem- ing to the user, and (ii) the reasoning process adapts as Dis project his/her preferences evolve. This way, our content-based strategy achieves a balance between the diversification 3. 2. Property sequences and the personalization of the offered recommendations Before detailing the reasoning methodology applied in Let C and p be the respective sets of all classes and all AVATAR, the next section describes: (i) the TV ontology properties defined in our ontology. Given a property used in this system and (ii) the profiles that store the view- PE P, its domain( denoted by domain(P)) limits the enti ers'preferences, modeled from the knowledge represented ties of C to which P can be applied, and its range( denoted in the ontology by range(P)) indicates the entities of C that P may take its value 3. The reasoning framework In [1] a property sequence PS is defined as a finite set of 3.1. The Tv ontolog properties [PI,., PN] that join several classes defined in the ontology. This can be formally expressed as follows: Since AVATAR is a TV recommender system, methodology requires an ontology that formalizes the BS={P1,…,PMP∈PW1≤j≤N, cepts and relationships typical in the Tv domain. This range()=domain(Pa+1)VI<i<NI information has been extracted from TV-Anytime [41]. specification that provides detailed semantic descriptions Example 1. For instance, in Fig. I, it is possible to identify about generic audiovisual programs the property sequence PS=HASACTOR, ACTORIN, Our TV ontology has been implemented in OWL. Spe- HASTOPICIjoining the classes Adventure Movies, Starring cifically, it includes a set of classes(representing program Actors, Drama Movies, and War Topics enres, topics, credits, geographical and temporal informa tion,etc. )and properties that establish relationships among An instance of Ps(denoted by ps)is defined as the set of them. Besides, our ontology defines hierarchical relation properties that join specific instances of the classes chips among classes, and among properties. In fact, it con- contained in PS. We use Y to represent that tains several hierarchies defined from the Tv-Anytime y is an instance of Y, being Y a entity(class or property metadata: hierarchies of genres(action movies, nature doc in the ontology. According to this notation, given umentaries,sports,etc.),hierarchies of topics(war, travel, PS=[PI,., PNl, we define its instance ps as follows (countries, cities, etc. ) hierarchies of credits(actors, direc- Ps=(P1,.., Pwl/p-P,VI<J<NS (2) tors, hosts, etc. ), etc L In order to reason about specific TV programs and to Example 2. In Fig. I, it also is possible to identufy an er semantic associations among them, it is necessary to add specific instances of the classes and properties defined In this case, the instance ps =[HasActor, ActorIn, Has Top- in the OwL ontology. Specifically, the Tv programs are ic] links the nodes Cast Away, Tom Hanks, Saving Private represented as instances of the classes defined in the hierar- Rvan, and World War II. chy of genres, and each one is given a unique reference (henceforth, ID). The semantic attributes of these pro- O approac. [l] the following grams(topics, geographical and temporal information, definitions credits, etc )are also defined as instances belonging to the remaining hierarchies of classes mentioned before. These (1) The origin and the terminus of a sequence are the first characteristics are linked to each program by means of and the last nodes contained in it, respectively properties, as shown in the excerpt of the ontology repre (1. 1)As X E C can be the origin of several property sented in Fig. I sequences, we use Ps to identify a sequence originated in X Note that owl Object Property identifies a property between two nodes labeled with a name, whereas rdf type of represents the relationshi between a class and one of its instances. For simplicity, we omitted some 7 For simplicity, we use the term property sequence to refer both to the classes and rdf typeof links in Fig. I(e.g. links between some specific sequences and to their instances. The sequences actors and the class Starring Actors) uppercase letters(i.e. PS), and their instances by lowercase ones (ps)

gested. In order to guarantee the computational viability, we use a controlled inference mechanism that explores the knowledge base by selecting instances significant to the user and omitting those which are totally irrelevant. Thanks to this mechanism, our methodology guarantees that: (i) the discovered associations find programs appealing to the user, and (ii) the reasoning process adapts as his/her preferences evolve. This way, our content-based strategy achieves a balance between the diversification and the personalization of the offered recommendations. Before detailing the reasoning methodology applied in AVATAR, the next section describes: (i) the TV ontology used in this system and (ii) the profiles that store the viewers’ preferences, modeled from the knowledge represented in the ontology. 3. The reasoning framework 3.1. The TV ontology Since AVATAR is a TV recommender system, our methodology requires an ontology that formalizes the concepts and relationships typical in the TV domain. This information has been extracted from TV-Anytime [41], a specification that provides detailed semantic descriptions about generic audiovisual programs. Our TV ontology has been implemented in OWL. Specifically, it includes a set of classes (representing program genres, topics, credits, geographical and temporal information, etc.) and properties that establish relationships among them. Besides, our ontology defines hierarchical relationships among classes, and among properties. In fact, it contains several hierarchies defined from the TV-Anytime metadata: hierarchies of genres (action movies, nature documentaries, sports, etc.), hierarchies of topics (war, travel, disasters, etc.), hierarchies of geographical information (countries, cities, etc.), hierarchies of credits (actors, directors, hosts, etc.), etc. In order to reason about specific TV programs and to infer semantic associations among them, it is necessary to add specific instances of the classes and properties defined in the OWL ontology. Specifically, the TV programs are represented as instances of the classes defined in the hierarchy of genres, and each one is given a unique reference (henceforth, ID). The semantic attributes of these programs (topics, geographical and temporal information, credits, etc.) are also defined as instances belonging to the remaining hierarchies of classes mentioned before. These characteristics are linked to each program by means of properties, as shown in the excerpt of the ontology represented in Fig. 1. 6 The properties are the key elements of our semantic reasoning methodology. Specifically, our approach explores both these properties and the instances (and classes) joined by them, with the goal of uncovering meaningful semantic associations hidden in the knowledge base. To this aim, we adopt the notion of property sequence defined in the SemDis project. 3.2. Property sequences Let C and P be the respective sets of all classes and all properties defined in our ontology. Given a property P 2 P, its domain (denoted by domain (P)) limits the entities of C to which P can be applied, and its range (denoted by range (P)) indicates the entities of C that P may take as its value. • In [1], a property sequence PS is defined as a finite set of properties [P1,...,PN] that join several classes defined in the ontology. This can be formally expressed as follows: PS ¼ f½P1; ... ; P N =Pj 2 P 8 1 6 j 6 N; range ðPiÞ ¼ domainðPiþ1Þ 8 1 6 i < Ng ð1Þ Example 1. For instance, in Fig. 1, it is possible to identify the property sequence PS = [HASACTOR, ACTORIN, HASTOPIC] joining the classes Adventure Movies, Starring Actors, Drama Movies, and War Topics. • An instance of PS (denoted by ps) is defined as the set of properties that join specific instances of the classes contained in PS. 7 We use y ! rdf :typeOf Y to represent that y is an instance of Y, being Y a entity (class or property) in the ontology. According to this notation, given PS = [P1,...,PN], we define its instance ps as follows: ps ¼ f½p1; ... ; pN =pj ! rdf :typeOf Pj 8 1 6 j 6 Ng ð2Þ Example 2. In Fig. 1, it also is possible to identify an instance of the property sequence PS used in Example 1. In this case, the instance ps = [HasActor, ActorIn, HasTopic] links the nodes Cast Away, Tom Hanks, Saving Private Ryan, and World War II. Our approach also borrows from [1] the following definitions: (1) The origin and the terminus of a sequence are the first and the last nodes contained in it, respectively. (1.1) As X 2 C can be the origin of several property sequences, we use PSX to identify a sequence originated in X: 6 Note that owl:ObjectProperty identifies a property between two nodes labeled with a name, whereas rdf:typeOf represents the relationship between a class and one of its instances. For simplicity, we omitted some classes and rdf:typeof links in Fig. 1 (e.g. links between some specific actors and the class Starring Actors). 7 For simplicity, we use the term property sequence to refer both to the sequences and to their instances. The sequences are represented by uppercase letters (i.e. PS), and their instances by lowercase ones (ps). 308 Y. Blanco-Ferna´ndez et al. / Knowledge-Based Systems 21 (2008) 305–320

PSX ¼ f½P1;...;P M =Pt 2P 8 1 6 t 6 M; X 2C; domainðP1Þ ¼ X; rangeðPiÞ ¼ domainðPiþ1Þ 8 1 6 i < Mg (1.2) Analogously, being x an instance of class X 2 C, we use psx to identify an instance of the sequence PSX originated in x. This way, given PSX = [P1,...,PM], psx is defined as follows: psx ¼ f½p1; ... ; pM =x ! rdf :typeOf X; pi ! rdf :typeOf Pi; 8 1 6 i 6 Mg (2) The length of ps is the number of properties contained in the sequence. In our approach, the sequence with the minimum length between two given instances is named geodesic sequence. (3) Like in SemDis, we define two functions to access the classes Cj of the property sequence PSX (function PSX.Nodes Of PS ()), and the nodes cj of one of its instances psx (function psx .PSNodesSequence()). On one hand, given PSX = [P1,...,PM], we define the first function as: PSX :NodesOfPSðÞ ¼ f½C1;...;CMþ1=Ct 2 C 8 1 6 t 6 M þ1; C1 ¼ X; Cj ¼ domainðPjÞ; 8 1 6 j 6 M; CNþ1 ¼ range Pð Þg ð M 3Þ On the other hand, given psx = [p1,...,pM] (an instance of the sequence PSX = [P1,...,PM]) and given that PSX.NodesOfPS() = [C1,...,CM+1], we define the function psx .PSNodesSequence() as follows: psx :PSNodesSequenceðÞ ¼ f½c1; ... ; cMþ1=c1 ¼ x; cj ! rdf :typeOf Cj; 8 1 6 j 6 M þ 1g ð4Þ Example 3. Let us consider in Fig. 1 two8 property sequences originated in x ¼ CastAway : psx 1 ¼ ½HasDirector (length 1), and psx 2 ¼ ½HasActor; ActorIn; HasTopic (length 3). The nodes included in both sequences are psx 1. PSNodesSequence = {Cast Away, Rober Zemeckis} and psx 2:PSNodes Sequence ¼ fCastAway; TomHanks; SavingPrivateRyan; WorldWarIIg: Finnish Tours Travelling Adv. Discover Spain! HasAgency HasTopic HasTopic HasTopic rdf:type Of owl:Object Property Camaron Flamenco dancing Carmen Flamenco Women Oscar Jaenada Spain Belle Epoque Fernando Trueba Penelope Cruz Jorge Sanz Spanish Civil War World War II Saving Private Ryan Tom Hanks Robert Zemeckis Cast Away TopicIn TopicIn HasPlace PlaceIn HasPlace PlaceIn HasActor HASACTOR ACTORIN HASTOPIC HasActor HasActress ActorIn ActorIn HasDirector HasDirector Tourism Contents Romance Movies Adventure Movies War Topics Directors Classes Instances Starring Actors Drama Movies Fig. 1. Subset of instances, classes and properties in our OWL ontology. 8 In the following, we will use a subindex to identify each sequence ps originated in a given node. Y. Blanco-Ferna´ndez et al. / Knowledge-Based Systems 21 (2008) 305–320 309

310 Y. Blanco-Fernandez et al. Knowledge- Based Systems 21(2008)305-320 3.3. The user profiles where Cm is the superclass of Cm+I and #sib(Cm+l)de- notes the number of siblings of the class Cm+i in the Our proposal models the users' preferences by reusing hierarchy of genres defined in the TV ontology the knowledge formalized in the OWl ontology(hence our profiles are called ontology -profiles ) In contrast with As a result of Eq(5), this approach leads to higher other works that combine user modeling and use of ontol- DOI indexes for the superclasses closer to the leaf class ogies[23], our approach requires a more detailed semantic whose value is being propagated, and lower ones for model of the user preferences. This model stores the pro- the classes which are closer to the root of the hierarchy grams interesting and unappealing to the user(named posi- In addition, the higher the dOi of a given class and the tive and negative preferences, respectively ), along with their lower the number of its siblings, the higher the Doi semantic descriptions. Such descriptions refer both to their index propagated to its superclass. As a class can b main semantic characteristics and to the hierarchical genres superclass to multiple classes, every time its DoI is these programs belong to(e. g. action movies, sport, music, updated by eq. (5), we average the indexes of all of its nature documentaries, etc. ) Since this information is subclasses defined in the profile, and so compute its final already defined in the TV ontology, our approach does DoI index. not store it again in each user profile. Really, a user profile Our ontology-based model enables intelligent reasoning contains the IDs of the programs which are(un ) appealing about the semantics of the user preferences, and allows to to him/her. From their IDs, the recommender system can discover- by inferring semantic associations -enhanced locate these programs in the OWL ontology and access recommendations that would go unnoticed in the existing their complete semantic descriptions. This way, AVATAR systems. This reasoning process favors a more effective knows the attributes and genres of each program pointed matching process between the viewer profile and the pro- to from the user profile grams available in AVATAR, beyond the syntactic match- Along with the programs IDs, the profiles store the lev- ing techniques described in Section 2.1 els of interest for each instance and for each class referred Example 4. Let U be a Finnish viewer who has viewed the to the user preferences. The degrees of in terest(named following programs: (i)the movie Camaron about a famous DOI indexes)belong to the range [-1, 1] and can be spec- flamenco artist, (i) an advertisement about the travel ified by the user or inferred automatically by the recom- mender system. Specifically, the value of the Doi index agency Finnish Tours, (i) the movie Cast Away whose corresponding to a program recommended by avatar starring actor is Tom Hanks, and(iv) the movie Saving Prirate Ryan that also involves this actor and it is about depends on several factors, such as the user's reply to the World War II(henceforth wwIn). Taking into account the suggestion(accept or reject), the percentage of the program knowledge represented in Fig. I, our approach builds Us viewed by the user, and the time elapsed until the user deci- des to view the recommended program [6] profile from the instances referred to the aforementioned The Doi of each program is also used to set the doi programs their main semantic attributes and the hierarchy indexes of: (i) each one of its semantic characteristics of genres they belong to, as shown in Fig. 2. The DOI indexes (in brackets in Fig. 2)stored in this profile show (actors, presenters, topics, etc. ) and (i)the genres under that U liked all of the programs, except the war movie which this program is classified in the ontology Saving Private ryan The semantic characteristics of a given program inherit the DOI index of this program. Besides, if a characteris- 4. Semantic associations tic is related to several programs, we calculate its DOI index by averaging their respective levels of interest. In order to find programs semantically related to the Regarding the computation of the doi indexes for each viewer preferences, our methodology locates in the OWL ass(genre)included in the profile, we firstly compute ontology the programs he/she liked in the past,and the DOI of each leaf class as the average value of the explores the property sequences starting from them. As DOI indexes assigned to the programs in the profile these sequences are traversed and new instances are belonging to that class. Then, we propagate these values reached, we predict their relevance for the user. Once the through the genre hierarchy until reaching its root class. irrelevant instances have been filtered out, we get a set of For that purpose, we adopt the approach proposed in property sequences from which the semantic associations [43]. that leads to Eq. (5) (between specific TV programs) must be inferred, and the final recommendations are elab DOI(Cm) DOI(Cm+1) +#sib(Cm+l) For clarity, Fig. 2 shows the classes and instances which identify Us 9 These instances identify specific TV programs and their semantic preferences actually, Pu only contains the IDs of the four programs and attributes, whereas the classes refer to the genres under which the the DoI indexes for each one of these classes and instances(which programs are classified in the ontology stored in the OWl ontology and are pointed from the profile)

3.3. The user profiles Our proposal models the users’ preferences by reusing the knowledge formalized in the OWL ontology (hence, our profiles are called ontology-profiles). In contrast with other works that combine user modeling and use of ontologies [23], our approach requires a more detailed semantic model of the user preferences. This model stores the programs interesting and unappealing to the user (named positive and negative preferences, respectively), along with their semantic descriptions. Such descriptions refer both to their main semantic characteristics and to the hierarchical genres these programs belong to (e.g. action movies, sport, music, nature documentaries, etc.). Since this information is already defined in the TV ontology, our approach does not store it again in each user profile. Really, a user profile contains the IDs of the programs which are (un)appealing to him/her. From their IDs, the recommender system can locate these programs in the OWL ontology and access their complete semantic descriptions. This way, AVATAR knows the attributes and genres of each program pointed to from the user profile. Along with the programs IDs, the profiles store the levels of interest for each instance and for each class referred to the user preferences.9 The degrees of interest (named DOI indexes) belong to the range [1, 1] and can be specified by the user or inferred automatically by the recommender system. Specifically, the value of the DOI index corresponding to a program recommended by AVATAR depends on several factors, such as the user’s reply to the suggestion (accept or reject), the percentage of the program viewed by the user, and the time elapsed until the user decides to view the recommended program [6]. The DOI of each program is also used to set the DOI indexes of: (i) each one of its semantic characteristics (actors, presenters, topics, etc.), and (ii) the genres under which this program is classified in the ontology. • The semantic characteristics of a given program inherit the DOI index of this program. Besides, if a characteristic is related to several programs, we calculate its DOI index by averaging their respective levels of interest. • Regarding the computation of the DOI indexes for each class (genre) included in the profile, we firstly compute the DOI of each leaf class as the average value of the DOI indexes assigned to the programs in the profile belonging to that class. Then, we propagate these values through the genre hierarchy until reaching its root class. For that purpose, we adopt the approach proposed in [43], that leads to Eq. (5): DOIðCmÞ ¼ DOIðCmþ1Þ 1 þ #sibðCmþ1Þ ð5Þ where Cm is the superclass of Cm+1 and #sib(Cm+1) denotes the number of siblings of the class Cm+1 in the hierarchy of genres defined in the TV ontology. As a result of Eq. (5), this approach leads to higher DOI indexes for the superclasses closer to the leaf class whose value is being propagated, and lower ones for the classes which are closer to the root of the hierarchy. In addition, the higher the DOI of a given class and the lower the number of its siblings, the higher the DOI index propagated to its superclass. As a class can be superclass to multiple classes, every time its DOI is updated by Eq. (5), we average the indexes of all of its subclasses defined in the profile, and so compute its final DOI index. Our ontology-based model enables intelligent reasoning about the semantics of the user preferences, and allows to discover – by inferring semantic associations – enhanced recommendations that would go unnoticed in the existing systems. This reasoning process favors a more effective matching process between the viewer profile and the programs available in AVATAR, beyond the syntactic matching techniques described in Section 2.1. Example 4. Let U be a Finnish viewer who has viewed the following programs: (i) the movie Camaron about a famous flamenco artist, (ii) an advertisement about the travel agency Finnish Tours, (iii) the movie Cast Away whose starring actor is Tom Hanks, and (iv) the movie Saving Private Ryan that also involves this actor and it is about World War II (henceforth WWII). Taking into account the knowledge represented in Fig. 1, our approach builds U’s profile from the instances referred to the aforementioned programs, their main semantic attributes and the hierarchy of genres they belong to, as shown in Fig. 2. 10 The DOI indexes (in brackets in Fig. 2) stored in this profile show that U liked all of the programs, except the war movie Saving Private Ryan. 4. Semantic associations In order to find programs semantically related to the viewer preferences, our methodology locates in the OWL ontology the programs he/she liked in the past, and explores the property sequences starting from them. As these sequences are traversed and new instances are reached, we predict their relevance for the user. Once the irrelevant instances have been filtered out, we get a set of property sequences from which the semantic associations (between specific TV programs) must be inferred, and the final recommendations are elaborated. 9 These instances identify specific TV programs and their semantic attributes, whereas the classes refer to the genres under which the programs are classified in the ontology. 10 For clarity, Fig. 2 shows the classes and instances which identify U’s preferences; actually, PU only contains the IDs of the four programs and the DOI indexes for each one of these classes and instances (which are stored in the OWL ontology and are pointed from the profile). 310 Y. Blanco-Ferna´ndez et al. / Knowledge-Based Systems 21 (2008) 305–320

Y. Blanco-Fernandez et al. Knowledge-Based Systems 21(2008)305-320 rdfs Sub Class Of → rdf: type Of [1]/Oscar → owl: Object Property ○sans Movies (0.9) 阿 War‖(1) (0.3) Contents Movies - Zemeckis Contents Travelling (1) [1]hasActor [2 hasDirector [3] has Topic Fig. 2. The user ontology profle PL n order to identify the semantic associations between (ii)Association p-join: Let PS and PS, be two joined two instances as described in SemDis, it is necessary to property sequences with a join node C(i.e. PSID explore the property sequences in which both of them are PS2), and let psI and ps2 be two instances of included, and the relationships between these sequences PS and PS,, respectively. Then, p-join Associat Let x and y be two instances in our OWl knowledge base ( x, y) is true if x is the origin of psi and y is the referred to two specific Tv programs. In addition, let origin of ps2, or x is terminus of psi and y terminus PSI=[Po,., Pn] and PS2=[@o,., 0m] be two property sequences, instantiated by psI and ps2, respectively. Or Example 6. From the property sequences Ps Saving approach adopts the following relationship between two Private Ryan-wwII and ps2: Belle Epoque-Spanish property sequences Joined property sequences: PS and PS2 are joined Civil War, it is possible to infer the semantic associ (denoted by PSDe PS2) if they contain at least one com ation p-join Associated(Saving Private Ryan, Belle Epo- mon class named join node Using the function formalized que) by means of the join class War Topics. Thi in expression 3, PS o PS, holds if PS. Nodes- association detected because both movies are settled OfPS()n PS]. Nodes OfPS()=ICI is a nonempty set where C is the join node. From the above relation, the authors of [l]define several 5. Filtering and inference methodology semantic associations As we mentioned in previous sections, our approach thAssociated(x, y) is true if must only uncover associations relevant for the user,i.e there exists a property sequence ps where x is the ori- ssociations that relate in a significant way his/her positive gin and y is the terminus, or vice versa preferences to the finally suggested TV programs. The ranking process of semantic associations has also been Example 5. From the sequence Camaron-Flamenco addressed in the SemDis project. There, the authors pro- dancing-Carmen in Fig. 1, p-pathAssociated(Camaron, posed a mechanism for uncovering all the associations Carmen)is true because both movies are about the same between two user-specified instances, and for ranking them topic according to their relevance for a query [18]. The applica

In order to identify the semantic associations between two instances as described in SemDis, it is necessary to explore the property sequences in which both of them are included, and the relationships between these sequences. Let x and y be two instances in our OWL knowledge base referred to two specific TV programs. In addition, let PS1 = [P0,...,Pm] and PS2 = [Q0,...,Qm] be two property sequences, instantiated by ps1 and ps2, respectively. Our approach adopts the following relationship between two property sequences. Joined property sequences: PS1 and PS2 are joined (denoted by PS1fflqPS2) if they contain at least one common class named join node. Using the function formalized in expression 3, PS1fflqPS2 holds if PS1. NodesOfPS() \ PS2.NodesOfPS() = {C} is a nonempty set, where C is the join node. From the above relation, the authors of [1] define several semantic associations. (i) Association q-path: q-pathAssociated(x,y) is true if there exists a property sequence ps where x is the origin and y is the terminus, or vice versa. Example 5. From the sequence Camaron–Flamenco dancing–Carmen in Fig. 1, q-pathAssociated (Camaron, Carmen) is true because both movies are about the same topic. (ii) Association q-join: Let PS1 and PS2 be two joined property sequences with a join node C (i.e. PS1fflq PS2), and let ps1 and ps2 be two instances of PS1 and PS2, respectively. Then, q-joinAssociated(x,y) is true if x is the origin of ps1 and y is the origin of ps2, or x is terminus of ps1 and y terminus of ps2. Example 6. From the property sequences ps1: Saving Private Ryan–WWII and ps2: Belle Epoque–Spanish Civil War, it is possible to infer the semantic association q-joinAssociated (Saving Private Ryan, Belle Epoque) by means of the join class War Topics. This association is detected because both movies are settled in wars. 5. Filtering and inference methodology As we mentioned in previous sections, our approach must only uncover associations relevant for the user, i.e. associations that relate in a significant way his/her positive preferences to the finally suggested TV programs. The ranking process of semantic associations has also been addressed in the SemDis project. There, the authors proposed a mechanism for uncovering all the associations between two user-specified instances, and for ranking them according to their relevance for a query [18]. The applicardfs:Sub Class Of owl:Object Property rdf:type Of [1] hasActor [2] hasDirector [3] hasTopic Oscar Jaenada (0.9) Tom Hanks (0) World War II -1( ) Travelling Adv. 1( ) Cast Away (1) Camaron (0.9) Saving Private Ryan -1( ) [3] [3] [1] [1] [1] [2] Flamenco dancing (0.9) Robert Zemeckis (1) Classes Instances Romance Movies (0.9) Drama Movies (0.3) Adventure Movies (1) Tourism Contents ( ) 1 TV Contents Movies (0.8) Fig. 2. The user ontology-profile PU. Y. Blanco-Ferna´ndez et al. / Knowledge-Based Systems 21 (2008) 305–320 311

Y. Blanco-Fernandez et al. Knowledge- Based Systems 21(2008)305-320 tion of this approach in AVATAR would imply discover- referred to Us positive preferences (i.e. TV programs with ing all of the possible associations between each program positive DOI indexes in his/her profile). This process is the user liked and each program defined in the OWL ontol- based on the assumption that the semantic relationship ogy. After applying the SemDis ranking process, AVA between a node of the sequence and the user preferences TAR would select(and recommend )a set of programs decreases along the sequence length. Consequently, the semantically related to the viewer's preferences goal is to find out which instance is the terminus of each As this approach would be too demanding in compu- considered sequence, that is, the last node with a significant tational terms, we propose a new inference methodology relevance for U. In other words, our approach must extract that: (i)locates the user preferences in the OWl knowl- a set of property sequences whose instances are relevant for edge base, (ii) explores successively the property U, and infer from them semantic associations between spe sequences starting from these programs, and (ii) filters cific TV programs. out the irrelevant instances contained in the analyzed The proposed inference methodology requires several sequences parameters notation (i )Firstly, our approach explores each instance Uf Let Pu be the profile of the user U, whose preferences included in the analyzed sequence, and computes a are N, instances that identify specific TV programs new parameter, named semantic intensity and denoted and their semantic attributes. From among the NI by isem(Up ). Intuitively, the e semantic intensity of instances we consider that. instance is a measure of the strength of the relation- p instances refer to progra ship between the instance and the user preferences sa identify semantic attributes (i.e. the greater its value, the stronger the relation The computation of this parameter will be detailed in Section 5.2 Each one of the programs identified in Pu(denoted by x i with I i< Np)is the origin of Ri property sequences (i)Next, the semantic intensity of u is compared to a the OWl knowledge base (denoted by psk with semantic threshold (denoted by isemy for the user U 1≤k≤R) and belonging to [0, 1]. If 7sem(Uk) is greater than As shown in Fig 3, each one of these sequences pspcon 7Semy, the specified node is included in the sequence, tains N.nodes(remember expression 4), which are and the process is repeated for the following node in the node xi and the sequence. Otherwise, this node is filtered out and he nodes denoted by t(with2≤j≤M) in the the precedent node is chosen as the sequence terminus. following expression The process is repeated for each program defined in psX PSNodesSequence(=kx=u:, UK, .. U k1(6) is that program in the OWL knowledge base. Regarding 'Semy, note that AVATAR uses a semantic threshold for each user, whose value (initially equal to 0.5 for all of 5. 1. Filtering of property sequences hem)depends on the accuracy of the past recommenda tions. As the precision of the offered suggestion Our methodology analyzes the property se quences defined in the Tv ontolo hose origin are instances instances very strongly related to Us preferences 5.2. Computing To compute the semantic intensity of a specific instance included in a sequence, it is necessary to use some param- eters which allow to detect simple relationships between this instance and the user preferences. We explain these parameters in Section 5. 1. After that, our semantic infer ence methodology will be illustrated by computing the q q intensity of the node ug contained in the sequence psk shown in Fig. 4, whose origin is one of the programs xi entified(by its ID)in the profile Given three instances a, b and c, and their respective clas- ses Ca, Cb and Ce belonging to a hierarchy H, we define the ig. 3. Ri property sequences originated in program xp. tollowing concep

tion of this approach in AVATAR would imply discovering all of the possible associations between each program the user liked and each program defined in the OWL ontology. After applying the SemDis ranking process, AVATAR would select (and recommend) a set of programs semantically related to the viewer’s preferences. As this approach would be too demanding in computational terms, we propose a new inference methodology that: (i) locates the user preferences in the OWL knowledge base, (ii) explores successively the property sequences starting from these programs, and (iii) filters out the irrelevant instances contained in the analyzed sequences. Notation: • Let PU be the profile of the user U, whose preferences are NI instances that identify specific TV programs and their semantic attributes. From among the NI instances, we consider that: – NP instances refer to programs, and – NSA identify semantic attributes. • Each one of the programs identified in PU (denoted by xi with 1 6 i 6 NP) is the origin of Ri property sequences in the OWL knowledge base (denoted by psxi k with 1 6 k 6 Ri). • As shown in Fig. 3, each one of these sequences psxi k contains Nk,i nodes (remember expression 4), which are: – the node xi, and – the nodes denoted by v i;j k (with 2 6 j 6 Nk,i ) in the following expression: psxi k :PSNodesSequenceðÞ ¼ ½xi ¼ v i;1 k ; v i;2 k ; ... ; v i;Nk;i k ð6Þ 5.1. Filtering of property sequences Our methodology analyzes the property sequences defined in the TV ontology whose origin are instances referred to U’s positive preferences (i.e. TV programs with positive DOI indexes in his/her profile). This process is based on the assumption that the semantic relationship between a node of the sequence and the user preferences decreases along the sequence length. Consequently, the goal is to find out which instance is the terminus of each considered sequence, that is, the last node with a significant relevance for U. In other words, our approach must extract a set of property sequences whose instances are relevant for U, and infer from them semantic associations between specific TV programs. The proposed inference methodology requires several parameters: (i) Firstly, our approach explores each instance v i;j k included in the analyzed sequence, and computes a new parameter, named semantic intensity and denoted by kSemðv i;j k Þ. Intuitively, the semantic intensity of an instance is a measure of the strength of the relationship between the instance and the user preferences (i.e. the greater its value, the stronger the relation). The computation of this parameter will be detailed in Section 5.2. (ii) Next, the semantic intensity of v i;j k is compared to a semantic threshold (denoted by kSemU for the user U and belonging to [0,1]). If cSemðv i;j k Þ is greater than cSemU , the specified node is included in the sequence, and the process is repeated for the following node in the sequence. Otherwise, this node is filtered out and the precedent node is chosen as the sequence terminus. The process is repeated for each program defined in U’s profile, and for each property sequence whose origin is that program in the OWL knowledge base. Regarding cSemU , note that AVATAR uses a semantic threshold for each user, whose value (initially equal to 0.5 for all of them) depends on the accuracy of the past recommendations. As the precision of the offered suggestions decreases, cSemU is increased in order to infer only instances very strongly related to U’s preferences. 5.2. Computing the semantic intensity To compute the semantic intensity of a specific instance included in a sequence, it is necessary to use some parameters which allow to detect simple relationships between this instance and the user preferences. We explain these parameters in Section 5.2.1. After that, our semantic inference methodology will be illustrated by computing the intensity of the node v i;l k contained in the sequence psxi k shown in Fig. 4, whose origin is one of the programs xi identified (by its ID) in the profile PU. 5.2.1. Basic concepts Given three instances a,b and c, and their respective classes Ca, Cb and Cc belonging to a hierarchy H, we define the following concepts: xi p1 p2 p3 v i, v1 i, 2 p v1 i, 3 ps1 xi q1 q2 q3 ps xi Ri i, 2 vRi i, 3 vRi 1 N i, vRi N Ri ,i q N -1 Ri ,i N -1 1,i 1,i Fig. 3. Ri property sequences originated in program xi. 312 Y. Blanco-Ferna´ndez et al. / Knowledge-Based Systems 21 (2008) 305–320

Y. Blanco-Fernandez et al. Knowledge-Based Systems 21(2008)305-320 e(V)? Fig 4. Property sequence instance psk defined in our OWL ontology 5.2.1.I. Hierarchical similarity. The value of the hierarchi- Example 7. In Fig. 1, bFamencod Camaron, Spain) cal similarity between a and b(denoted by SimHie(a, b)) because Flamenco dancing is included in two out of the depends on the position of their classes Ca and Cb in the three geodesic property sequences(of length 3)existing hierarchy H. In the literature, it is possible to find numer- between Camaron and Spain( the instance Oscar Jaenada is ous taxonomy-based similarity metrics [34, 10, 33, 42]. As contained in the third shortest path between both nodes Ganesan explained in [15] these approaches do not accu- thus oScar , Camaron, Spain)=D) rately capture similarity in certain domains, such as when Finally, note that betweenness is also a factor for the data is sparse or when there are known relationships measuring the strength of the relationship between the a hierarchical domain structure in order to produce more higher b(a, b). This fact will be used in the inference intuitive similarity scores. These scores are based on signif- methodology described in the next section cant relationships between the considered concepts detected by means of the concepts depth and Lowest Com- mon Ancestor (LCA) 5.2.2. Components of the semantic intensity The computation of the semantic intensity for the node The depth of a is the number of hierarchical links tf, shown in Fig. 4- whose value grows along with the between the hierarchy root and the class Ca in H. Con- strength of the relationship between uk and the user prefer sequently, the depth of the root class is 0. ences-involves four components LCAab is defined as the class of greatest depth in H that is an ancestor of both Ca and Ch 5.2.2.1. CH. Hierarchical similarity between uk and Pu. Component CH detects hierarchical relationships between nese concepts, SemHiela, b)can be defined instance"and the user preferences defined in the profile follows Pu. Its value depends on:(i) the hierarchical similarity depth(LCAa.b) between uk and the g instances(pointed from Pu)belor SimHie(a, b) max(depth(a), depth(b)) (7) ing to the same class hierarchy of uk, and(i) the degree of interest of the user for these instances Firstly, our approach selects the instances x; which are only LCAab(because its depth is O). Besides, the deeper most(hierarchically) similar to uk. These instances, whose Ca and Ch, and the closer LCAah and both classes are in H, value of similarity with uk is M, are included in the set H, the greater SimHre a, b), because in this case both instances as shown in Eq.(9) are very specific and are closely related M=max{ SimMel(",x)}V,1≤j≤Q 5.2.1.2. Betweenness. In graph theory, betweenness is used H=x/SimHie(uk, x))=MI to measure the centrality of each node in the graph. It is Next, CHuK, Pu) is computed as the average of the DOI defined as the number of times that it is necessary to cross indexes of these instances, weighted with M, as shown in a node to join any pair of nodes through the shortest path Eq (10) in the graph(named geodesic path In our proposal, the betweenness of instance c with CH(U, Pu) DOlP(x) with x∈H(10) espect to a and b(denoted by b(a, b)) is the ratio of the geodesic property sequences whose origin is a, whose termi nus is b, and which also include c(see Eq. ( 8)) According to Eq. (10), the greater the number of hierarchi cal relationships between uk and the instances x; which are be(a, b) (8) most appealing to U, the higher the value of CH(UF, PU) Our approach compares programs vs programs in the genre hierarchy, topics vs. topics in the topic hierarchy, etc

5.2.1.1. Hierarchical similarity. The value of the hierarchical similarity between a and b (denoted by SimHie(a,b)) depends on the position of their classes Ca and Cb in the hierarchy H. In the literature, it is possible to find numerous taxonomy-based similarity metrics [34,10,33,42]. As Ganesan explained in [15], these approaches do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between the compared items. For that reason, we adopt Ganesan et al.’s approach and use a measure that exploits a hierarchical domain structure in order to produce more intuitive similarity scores. These scores are based on significant relationships between the considered concepts, detected by means of the concepts depth and Lowest Common Ancestor (LCA). • The depth of a is the number of hierarchical links between the hierarchy root and the class Ca in H. Consequently, the depth of the root class is 0. • LCAa,b is defined as the class of greatest depth in H that is an ancestor of both Ca and Cb. Based on these concepts, SemHie(a,b) can be defined as follows: SimHieða; bÞ ¼ depthðLCAa;bÞ maxðdepthðaÞ; depthðbÞÞ ð7Þ As shown in Eq. (7), SimHie(a,b) is 0 if the root class in H is the only LCAa,b (because its depth is 0). Besides, the deeper Ca and Cb, and the closer LCAa,b and both classes are in H, the greater SimHie(a,b), because in this case both instances are very specific and are closely related. 5.2.1.2. Betweenness. In graph theory, betweenness is used to measure the centrality of each node in the graph. It is defined as the number of times that it is necessary to cross a node to join any pair of nodes through the shortest path in the graph (named geodesic path). In our proposal, the betweenness of instance c with respect to a and b (denoted by bc(a,b)) is the ratio of the geodesic property sequences whose origin is a, whose terminus is b, and which also include c (see Eq. (8)). bcða; bÞ ¼ #geoða; c; bÞ #geoða; bÞ ð8Þ Example 7. In Fig. 1, bFlamenco d:ðCamaron; SpainÞ ¼ 2 3, because Flamenco dancing is included in two out of the three geodesic property sequences (of length 3) existing between Camaron and Spain (the instance Oscar Jaenada is contained in the third shortest path between both nodes, thus bOscar J.(Camaron, Spain) = 1 3). Finally, note that betweenness is also a factor for measuring the strength of the relationship between the instances involved in its computation. So, according to Eq. (8), the stronger the relationship between a, b and c, the higher bc(a,b). This fact will be used in the inference methodology described in the next section. 5.2.2. Components of the semantic intensity The computation of the semantic intensity for the node v i;l k shown in Fig. 4 – whose value grows along with the strength of the relationship between v i;l k and the user preferences – involves four components. 5.2.2.1. CH: Hierarchical similarity between v i;l k and PU. Component CH detects hierarchical relationships between instancei,j and the user preferences defined in the profile PU. Its value depends on: (i) the hierarchical similarity between v i;l k and the Q instances (pointed from PU) belonging to the same class hierarchy11 of v i;l k , and (ii) the degree of interest of the user for these instances. Firstly, our approach selects the instances xj which are most (hierarchically) similar to v i;l k . These instances, whose value of similarity with v i;l k is M, are included in the set H, as shown in Eq. (9). M ¼ maxfSimHieðv i;l k ; xjÞg 8 j; 1 6 j 6 Q H ¼ fxj=SimHieðv i;l k ; xjÞ ¼ Mg ð9Þ Next, CH ðv i;l k ; P U Þ is computed as the average of the DOI indexes of these instances, weighted with M, as shown in Eq. (10). CH ðv i;l k ; P U Þ ¼ M j H j X jHj j¼1 DOI P U ðxjÞ with xj 2 H ð10Þ According to Eq. (10), the greater the number of hierarchical relationships between v i;l k and the instances xj which are most appealing to U, the higher the value of CH ðv i;l k ; P U Þ. xi p1 p2 pl-2 pl-1 pl Sem pl 3 pN k,i i, l vk vk i, l-2 vk i, l-1 vk i, 2 vk i, N k,i vk i,l Fig. 4. Property sequence instance psxi k defined in our OWL ontology. 11 Our approach compares programs vs. programs in the genre hierarchy, topics vs. topics in the topic hierarchy, etc. Y. Blanco-Ferna´ndez et al. / Knowledge-Based Systems 21 (2008) 305–320 313

Y. Blanco-Fernandez et al. Knowledge- Based Systems 21(2008)305-320 On the contrary, this component is low when uI is (hierar- CL(Uk, Pu) chically) related to U's negative preferences Example 8. Let us focus on the user profile Pu shown in where pssk=lPi,.,Pi_l and pss*. PSNodes Sequence Fig. 2, and on the instance Spanish Civil War(belonging =ki,Uk,. Uk in Fig 4 to the War Topics class)in Fig. 1. To decide whether this instance is appealing to U, the component CH(spanish Example 10. For the sequence introduced in Example Civil War, Pu) considers the similarity between Spanish we get Cz(Carmen, Pu)=2>4=CL(Discover Spainl, Pu) Civil War and those instances in the profile Pu belonging Both values allow to detect that the relationship between to its same hierarchy (in this case, the hierarchy of Carmen and Camaron is more significant than the relation- topics). In this example, as U is not interested in the ship between the latter movie and the documentary Discover Spain! In other words, the relationship"movie instance wwll belonging to the hierarchy or p about the flamenco dancing"is more direct than"movie (because DOIPu(wwID=-1), AVATAR infers that the Spanish Civil War is not a topic appealing to him/her. about the country where flamenco dancing arose To uncover this knowledge, the system uses the value CH(Spanish Civil War, Pu)=-l obtained from Eq(10) In this case, H=wWII and the hierarchical similarity 5.2.2.4. Cc Semantic centrality of uk. At the end of Section between wWII and Spanish Civil War is I because 5, we said that the stronger the relationship between the both instances belong to the same class(War Topics in user preferences and uk, the greater the value of isem(up) Fig. 1) To quantify the strength of this relationship, we resort to the notion of semantic centrality, modeled by component Cc(Uk, Pu). We use the concept of betweenness, because 5.2.2.2. Cr User interest in x origin of the sequence psk this is well-suited for measuring significant relationships analyzed. If uF is closely related to an instance the user between the instances involved in its computation. 13Tak- likes, iSem(uk) will be high. This degree of interest is ing into account that the relationships on which we focus eflected in our proposal by the component C,(u:, Pu). are established between the instances identified in the pro- So, the greater the DOI of instance x;(the origin of the file Pu and u/ the computed betweenness must involve spe- analyzed sequence), the higher CI(Uk, Pu cifically these instances. Consequently, Cc(Uk, Pu) CU, Pu)=DOIPu (i ()computed by adding the betweenness of each instance in Pu, computed with respect to uK and the remaining NI-I instances pointed to from this profile, as shown g(13). High be Example 9. Let us recall the profile Pu shown in Fig. 2 some instances identified in the user profile to u it is nec- and the sequence Camaron-Flamenco dancing-Carmen- essary to cross other instances also contained in Us prefer Spain-Discover Spain! represented in Fig. 1. In ences. This way, it is possible to identify significant scenario, we compute component C, for two different nodes. relationships between uf and instances in the profile P So, C,(Carmen, Pu)=C,(DiscoverSpain!, Pu)= DOl pu Camaron)=0.9.However, the relationships between the Cc(u!, Pu)X,2 bx,(g, U) DOIPu(,)+DOI Pu(g) movie Camaron (viewed by the user) and the two men- tioned programs are not equally significant or direct: to discriminate both cases, we introduce the following where K= NANI-l)is a normalization constant ensuring Cc(Uk, Pu) falls within [0, 1]. Note that the N, instances contained in Pu identify both 5.2.2.3.CL: Length of the subsequence established between x; positive and negative preferences. So, it is necessary to and uf The length of a sequence is a relevant parameter for introduce the DoI of these instances in the computation deciding whether the relationship between its origin and of the semantic centrality, as shown in Eq.(13). This terminus is meaningful. Bearing in mind this idea, we intro- way, the stronger the relationship between uk and instances duce component CL(UF, Pu) in the computation of very interesting for the user U, the higher the value of com- ASem(ul). The longer the subsequence established between ponent Cc(oK, Pu) uR and x;( denoted by pss"in expression 12), the lower this Once all semantic intensity components have been identi component, because the relationship between both fied, we show in Eq. (14)the final expression of isem(uk), instances is less significant due to the presence of many where the weights W1, W2 and w3 were chosen experimentally intermediate nodes to be w1=w2=0.3, and w3=0. 4. In general, we observed that the best results were obtained when the centrality For simplicity, we do not represent the full hierarchy of topics Fig. l, but only the instances Spanish Ciuil War and WWll belonging te 13 Recall that b(a, b)is the ratio of geodesic property sequence between a the War Topics class in that hierarchy and b in which c is included

On the contrary, this component is low when v i;l k is (hierarchically) related to U’s negative preferences. Example 8. Let us focus on the user profile PU shown in Fig. 2, and on the instance Spanish Civil War (belonging to the War Topics class) in Fig. 1. To decide whether this instance is appealing to U, the component CH(Spanish Civil War, PU) considers the similarity between Spanish Civil War and those instances in the profile PU belonging to its same hierarchy (in this case, the hierarchy of topics12). In this example, as U is not interested in the instance WWII belonging to the hierarchy of topics (because DOIP U (WWII) = 1), AVATAR infers that the Spanish Civil War is not a topic appealing to him/her. To uncover this knowledge, the system uses the value CH (Spanish Civil War, PU) = 1 obtained from Eq. (10). In this case, H = {WWII} and the hierarchical similarity between WWII and Spanish Civil War is 1 because both instances belong to the same class (War Topics in Fig. 1). 5.2.2.2. CI: User interest in xi, origin of the sequence psxi k analyzed. If v i;l k is closely related to an instance the user likes, kSemðv i;l k Þ will be high. This degree of interest is reflected in our proposal by the component CIðv i;l k ; P U Þ. So, the greater the DOI of instance xi (the origin of the analyzed sequence), the higher CIðv i;l k ; P U Þ: CIðv i;l k ; P U Þ ¼ DOI P U ðxiÞ ð11Þ Example 9. Let us recall the profile PU shown in Fig. 2 and the sequence Camaron–Flamenco dancing–Carmen– Spain–Discover Spain! represented in Fig. 1. In this scenario, we compute component CI for two different nodes. So, CIðCarmen; P U Þ ¼ CIðDiscoverSpain!; P U Þ ¼ DOI P U ðCamaronÞ ¼ 0:9. However, the relationships between the movie Camaron (viewed by the user) and the two mentioned programs are not equally significant or direct; to discriminate both cases, we introduce the following component. 5.2.2.3. CL: Length of the subsequence established between xi and v i;l k . The length of a sequence is a relevant parameter for deciding whether the relationship between its origin and terminus is meaningful. Bearing in mind this idea, we introduce component CLðv i;l k ; P U Þ in the computation of kSemðv i;l k Þ. The longer the subsequence established between v i;l k and xi (denoted by pssxi in expression 12), the lower this component, because the relationship between both instances is less significant due to the presence of many intermediate nodes. CLðv i;l k ; P U Þ ¼ 1 LengthðpssxiÞ ð12Þ where pssxi ¼ ½p1; ... ; pl1 and pssxi :PSNodesSequence ðÞ ¼ ½xi; v i;2 k ; ... ; v i;l k in Fig. 4. Example 10. For the sequence introduced in Example 9, we get CLðCarmen; P U Þ ¼ 1 2 > 1 4 ¼ CL (Discover Spain!, PU). Both values allow to detect that the relationship between Carmen and Camaron is more significant than the relationship between the latter movie and the documentary Discover Spain! In other words, the relationship ‘‘movie about the flamenco dancing’’ is more direct than ‘‘movie about the country where flamenco dancing arose’’. 5.2.2.4. CC: Semantic centrality of v i;l k . At the end of Section 5, we said that the stronger the relationship between the user preferences and v i;l k , the greater the value of kSemðv i;l k Þ. To quantify the strength of this relationship, we resort to the notion of semantic centrality, modeled by component CCðv i;l k ; P U Þ. We use the concept of betweenness, because this is well-suited for measuring significant relationships between the instances involved in its computation.13 Taking into account that the relationships on which we focus are established between the instances identified in the pro- file PU and v i;l k , the computed betweenness must involve specifically these instances. Consequently, CCðv i;l k ; P U Þ is computed by adding the betweenness of each instance in PU, computed with respect to v i;l k and the remaining NI 1 instances pointed to from this profile, as shown in Eq. (13). High betweenness values indicate that going from some instances identified in the user profile to v i;l k , it is necessary to cross other instances also contained in U’s preferences. This way, it is possible to identify significant relationships between v i;l k and instances in the profile PU. CCðv i;l k ; P U Þ ¼ 1 K XNI f ;g;f 6¼g bxf ðxg; v i;l k Þ DOI P U ðxf Þ þ DOI P U ðxgÞ 2 ð13Þ where K = NI(NI 1) is a normalization constant ensuring CCðv i;l k ; P U Þ falls within [0, 1]. Note that the NI instances contained in PU identify both positive and negative preferences. So, it is necessary to introduce the DOI of these instances in the computation of the semantic centrality, as shown in Eq. (13). This way, the stronger the relationship between v i;l k and instances very interesting for the user U, the higher the value of component CCðv i;l k ; P U Þ. Once all semantic intensity components have been identi- fied, we show in Eq. (14) the final expression of kSemðv i;l k Þ, where the weights w1,w2 and w3 were chosen experimentally to be w1 = w2 = 0.3, and w3 = 0.4. In general, we observed that the best results were obtained when the centrality 12 For simplicity, we do not represent the full hierarchy of topics in Fig. 1, but only the instances Spanish Civil War and WWII belonging to the War Topics class in that hierarchy. 13 Recall that bc(a,b) is the ratio of geodesic property sequence between a and b in which c is included. 314 Y. Blanco-Ferna´ndez et al. / Knowledge-Based Systems 21 (2008) 305–320

点击下载完整版文档（PDF格式）

共16页，试读结束，阅读完整版请下载

点击下载（PDF格式）

浏览记录