The Journal of Systems and Software 85(2012)87-101 Contents lists available at Science Direct The Journal of Systems and Software ELSEVIER journalhomepagewww.elsevier.com/locate/iss Keyword clustering for user interest profiling refinement within paper recommender systems Xiaoyu Tang, Qingtian Zeng College of information Science and Engineering. Shandong University of Science and Technology, No 579 Qianwangang Road, Qingdao 266510, PR China ARTICLE INFO A BSTRACT To refine user interest profiling, this paper focuses on extending scientific subject ontology via key- word clustering and on improving the accuracy and effectiveness of recommendation of the electronic in online services. A ne 23 July 2011 the purpose of the subject ontology extension. Based on the keyword clusters, the construction of ser interest profiles is presented on a rather fine granularity level. In the construction of user inter- est profiles, we apply two types of interest profiles: explicit profiles and implicit profiles. The explicit profiles are obtained by relating users' interest-topic relevance factors to users'interest measurements ser interest profiles on the basis of the correlative relationships among the topic nodes in topic network graphs. Three recommender systems experiments are conducted which reveal that the uses of the subject ontology extension approach as Ontology extension vell as the two types of interest profiles satisfyingly contribute to an improvement in the accuracy of Crown Copyright o 2011 Published by Elsevier Inc. All rights reserved. 1. Introduction will probably agree There are also case-based approache like CASPER (Smyth et al., 2002). Entree system(Burke et al, 1997) Recommender systems form or work from a specific type of and some other profiling methods based on artificial neutral net formation filtering system technique that attempts to recom- work, such as Tan et al. (1998), Shepherd et al. (2002)and Kim et al end information items that are likely to be of interest to the (2002). user. Typically, a recommender system compares a user profile to he utilization of ontologies in user profiling techniques has certain reference characteristics and seeks to predict the 'rating gained much attention since it allows inference to be employed that a user would give to an item they had not yet considered enabling interests to be discovered that were not directly observed (Wikipedia, 2010). The current generation of recommendation in the user's behavior (Middleton et al, 2004: Zeng et al, 2009: methods is usually classified into the following three main cate- Wu et al, 2009). Additionally, once profiles are represented using gories: content-based, collaborative, and hybrid recommendation an ontology, they can communicate with other ontologies which pproaches(Adomavicius and tuzhilin, 2005 ). Among the ma share similar concepts, which contributes to knowledge reuse and techniques of recommender systems, user profiling forms the basis abates the effect of the cold-start problem(Felden and Linden, of such systems. At the same time, much work has been done in 2007: Middleton et al., 2004). In the Foxtrot system(Middleton academia and industry on creating user profiles since the emer- et al., 2004). researchers improved recommendations based on gence of recommender systems. For example, VSM-based profiling the ontological profiling method by visualizing the profiles them pproaches are a basic way of conducting user interest modeling selves and presenting them to users. blanco-Fernandez et al. (2008 (Balabanovic and Shoham, 1997). In GroupLens, profiles of all users used a domain ontology in which the semantic descriptions of the are aggregated in a user-item rating matrix, in which each line cor- available items(e.g TV programs, books, CDs, etc )are formalized responds to a user, and each column to an item(Resnick et al. Their reasoning-based recommendation strategy discovers seman- 1994). Collaborative recommender systems produce recommen- tic relationships between the users' preferences and the items dations based on the heuristic that people who agreed in the past available in the domain ontology. These relationships provide the system with extra knowledge about the users'interests, thus favor In this paper, we propose a refined ontological profiling metho es:lonewar@163.com(xTang).qtzeng@163.com. based on our extended subject ontology. Two kinds of interest 1.cn,qingtianzeng@hotmail.com(QZeng). profiles are introduced: explicit profiles and implicit profiles. The 0164-1212/s-see front matter. Crown Copyright o 2011 Published by Elsevier Inc. All rights reserved. i:10.1016Js201107029
The Journal of Systems and Software 85 (2012) 87–101 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss Keyword clustering for user interest profiling refinement within paper recommender systems Xiaoyu Tang, Qingtian Zeng∗ College of Information Science and Engineering, Shandong University of Science and Technology, No. 579 Qianwangang Road, Qingdao 266510, PR China article info Article history: Received 5 October 2010 Received in revised form 8 July 2011 Accepted 13 July 2011 Available online 23 July 2011 Keywords: Weighted keyword graph Keyword clustering User interest profiles Recommender systems Ontology extension abstract To refine user interest profiling, this paper focuses on extending scientific subject ontology via keyword clustering and on improving the accuracy and effectiveness of recommendation of the electronic academic publications in online services. A clustering approach is proposed for domain keywords for the purpose of the subject ontology extension. Based on the keyword clusters, the construction of user interest profiles is presented on a rather fine granularity level. In the construction of user interest profiles, we apply two types of interest profiles: explicit profiles and implicit profiles. The explicit profiles are obtained by relating users’ interest-topic relevance factors to users’ interest measurements of these topics computed by a conventional ontology-based method, and the implicit profiles are acquired on the basis of the correlative relationships among the topic nodes in topic network graphs. Three experiments are conducted which reveal that the uses of the subject ontology extension approach as well as the two types of interest profiles satisfyingly contribute to an improvement in the accuracy of recommendation. Crown Copyright © 2011 Published by Elsevier Inc. All rights reserved. 1. Introduction Recommender systems form or work from a specific type of information filtering system technique that attempts to recommend information items that are likely to be of interest to the user. Typically, a recommender system compares a user profile to certain reference characteristics and seeks to predict the ‘rating’ that a user would give to an item they had not yet considered (Wikipedia, 2010). The current generation of recommendation methods is usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches (Adomavicius and Tuzhilin, 2005). Among the main techniques of recommender systems, user profiling forms the basis of such systems. At the same time, much work has been done in academia and industry on creating user profiles since the emergence of recommender systems. For example, VSM-based profiling approaches are a basic way of conducting user interest modeling (Balabanovic and Shoham, 1997 ´ ). In GroupLens, profiles of all users are aggregated in a user-item rating matrix, in which each line corresponds to a user, and each column to an item (Resnick et al., 1994). Collaborative recommender systems produce recommendations based on the heuristic that people who agreed in the past ∗ Corresponding author. E-mail addresses: lonewar@163.com (X. Tang), qtzeng@163.com, qtzeng@sdust.edu.cn, qingtianzeng@hotmail.com (Q. Zeng). will probably agree again. There are also case-based approaches like CASPER (Smyth et al., 2002), Entrée system (Burke et al., 1997), and some other profiling methods based on artificial neutral network, such as Tan et al. (1998), Shepherd et al. (2002) and Kim et al. (2002). The utilization of ontologies in user profiling techniques has gained much attention since it allows inference to be employed, enabling interests to be discovered that were not directly observed in the user’s behavior (Middleton et al., 2004; Zeng et al., 2009; Wu et al., 2009). Additionally, once profiles are represented using an ontology, they can communicate with other ontologies which share similar concepts, which contributes to knowledge reuse and abates the effect of the cold-start problem (Felden and Linden, 2007; Middleton et al., 2004). In the Foxtrot system (Middleton et al., 2004), researchers improved recommendations based on the ontological profiling method by visualizing the profiles themselves and presenting them to users. Blanco-Fernández et al. (2008) used a domain ontology in which the semantic descriptions of the available items (e.g., TV programs, books, CDs, etc.) are formalized. Their reasoning-based recommendation strategy discovers semantic relationships between the users’ preferences and the items available in the domain ontology. These relationships provide the system with extra knowledge about the users’ interests, thus favoring more accurate personalization processes. In this paper, we propose a refined ontological profiling method based on our extended subject ontology. Two kinds of interest profiles are introduced: explicit profiles and implicit profiles. The 0164-1212/$ – see front matter. Crown Copyright © 2011 Published by Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2011.07.029
8 in the existing ontologies mavicius and Tuzhilin, 2005; Correa da Silva et al., 2002) rofiling methods based on them (Middleton ork for generating nmender systems. wse, download mernoassnould ugh the user intertace at a of the research papers are
88 X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 construction algorithms for profiling explicit interests and for pro- filing implicit interests are also proposed. Based on our extended subject ontology, we take into account the pertinence of keywords and the classifications of the ontology in the calculation of explicit interest profiles. The method for inferring users’ potential interests (implicit interests) proposed employs a quantified evaluation of relationships between classifications in the ontology. The paper is organized into seven sections. Section 2 discusses the related work. Section 3 presents a brief description of our recommender system, based on which we evaluate our approach. In Section 4, we propose a way of clustering keywords for an existing ontology, in order to create a more detailed and accurate structure of the existing ontology. In Section5, the approaches to user interest profiles, which contain both explicit and implicit interest profiles, are presented. Section 6 introduces our prototype system called SPRS, and using this system we conducted three experiments. We evaluate the subject ontology extension method and refinement of user profiling approach. Section 7 concludes the paper and discusses future work. 2. Related work Ontology is a conceptualization of a domain into a humanunderstandable butmachine-readable format consisting of entities, attributes, relationships, and axioms (Guarino and Giaretta, 1995). It is used to alleviate the communication problems between systems due to ambiguous usage of different terms. For building user profiles, ontologies are used to address the so-called “cold-start problem” (Middleton et al., 2002; Susan and Gauch, 2004).Cantador et al. (2008) mapped social tagging information from multiple sources to their ontological structures that described the domains of interest covered by the tags, in order to build user profiles. Moreover, Middleton et al. (2004) used a term ontology to refer to the classification structure and instances within a knowledge base, representing the profiles in terms of a research paper topic ontology, which allows other interests to be inferred that go beyond those only seen in directly observed behavior. They also employ pro- file visualization to acquire profile feedback from users to improve profiling accuracy. A number of strategies have been implemented to facilitate the construction of ontology. For example, Mika (2007) extended the traditional bipartite model of ontologies with the social dimension, leading to a tripartite model of actors, concepts and instances. He also demonstrated the application of this representation by showing how community-based semantics emerges from this model through a process of graph transformation. In addition, Zhang et al. (2010) proposed a suite of ontology metrics, at both the ontologylevel and the class-level, to measure the design complexity of ontologies. Finally, in recent years complex network and other forms of network models such as bipartite network have been considered to be important aspects of recommender systems by many researchers (Zanin et al., 2008; Zhou et al., 2007, 2010). When focusing on the problem of recommending items to a user, the underlying transaction data can be seen as a bipartite network, in which users and items are represented as two groups of nodes, connected to each other by certain links (Zanin et al., 2008). In order to utilize the bipartite network, a one-mode projecting method is usually implemented as an alternative to using the bipartite network directly. Zhou et al. (2007) raised a novel one-mode projecting method to compress the bipartite network and better preserve the original information. Zhou et al. (2010) introduced and used evaluation criteria for the diversification of recommendation, aside from accuracy. They believe the next generation of information filtering methods should focus on not only precision but also diversification, and that a balance between them should be sought. There are some problems in the existing ontologies (Adomavicius and Tuzhilin, 2005; Correa da Silva et al., 2002) and the user interest profiling methods based on them (Middleton et al., 2001, 2003, 2004; Cantador et al., 2008; Susan and Gauch, 2004; Felden and Linden, 2007). (1) Most of the ontologies in use are framed in coarse granularity, making the classification of items obscure and undetermined. Therefore, the effectiveness of user profiling techniques based on such ontologies, no matter how sophisticated the interest profiling algorithms are, would deteriorate because of the coarse classification. (2) Typically, ontologies are usually predefined manually and remain fixed during a certain period of time, which makes them insensitive to new changes. When new research subjects emerge or other changes happen, the modifications must be done by their creators manually. This reaction is slow and tardy and needs human involvement. Moreover, typical ontology-based user interest calculation algorithms compute the interest value on each category for every user; however, these methods are not always effective. For instance, assume that two users, user A and user B have both viewed a certain number of papers in the subject “artificial intelligence” and we obtain the same interest value for their interests in this subject; therefore we believe their interests in “artificial intelligence” are no different. However, in fact some papers viewed by user A are about “support vector machine”, and the other papers viewed by this user are about “genetic algorithm”, whereas the papers viewed by user B are all related to “natural language processing”. In this case, the relatively subtle differences between the users’ interests in “artificial intelligence” are not discovered, leading to the description of user interests and the subsequent recommendation being inaccurate. (3) It is difficult for conventional approaches to differentiate the items within the same class (or subject). That is, in the case of research paper recommendation, different papers in the same subject contribute essentially equally to the construction of user profiles and the differences in textual information between papers is neglected, which is evidently unreasonable. (4) The inference of topics of interest via ontological relations between topics that have not been browsed explicitly by users is not precise enough. For instance, assume that there are three subclasses A, B and C under their immediate super class A, and a user has an interest in subclass A with an interest value of 0.4. With the aforementioned conventional profiling method, we get 0.2 for the user’s interest value of the super class A. In this scenario, the user will receive equivalent recommendations from subclass B and C, because the conventional profiling method does not take into account the differences between subclass B and C, leading to inaccuracy of the inference about a user’s implicit interest because these differences may be relevant to the user’s preferences. 3. Framework for generating user interest profiles In this section, we first present the framework for generating user interest profiles within online paper recommender systems, which is shown in Fig. 1. The main components in the framework include: (1) Subject ontology. The subject ontology is predefined by domain experts with suitable granularity and scale, which presents the organization structure of domain knowledge and serves as the taxonomy for research papers. Moreover, it is the basis of the user profile. To refine the subject ontology, we will discuss our method of extending the subject ontology through automatically clustering weighted keyword graph in Section 4. (2) Paper management module. Users can upload, browse, download and comment on any research papers through the user interface of the paper management module. All of the research papers are
X Tang Q Zeng/ The Journal of Systems and Software 85(2012)87-101 日 per Management Module Pa User 1 te rface 日 Subject User Behavior Monitoring E Subject2.1.1 Module User n User ProfilIng Module Behavi Datab ser Profiles Fig. 1. Framework for generating user interest profiles. stored in the paper database. Each paper in the paper database In the paper database storing the research paper da is classified according to the subject ontology and can readily each papers information contains its corresponding category be retrieved by users. The paper management module plays the mark based on the subject ontology. In this paper, vector role of fundamental component in the whole framework. space model is used to represent the research papers. Key (3)User behavior monitoring module This module is responsible for words are provided by the papers authors, representing the the background collection of the behavior data of each user. key content of each corresponding paper. In the vector space The user behavior data include searching keywords, browsing, model, a paper is represented by a keyword vector, i.e., downloading and commenting on papers, etc. The monitoring paper=(keywordl,., keywor keyword)(1≤i≤n).The nd collecting processes are totally unobtrusive. weight of the ith keyword keyword in paper is (4)User profiling module. The user profiling module makes use of as WKP(keyword, paper), which is computed by the tE the user behavior data recorded by the user behavior moni- oring module the paper database and the subject to create user profiles. The user profiles obtained ca to recommend papers to these users. We will elaborate our COMPUTER SICENCI profiling approach in Section 6. 4. Automatic ontology extension through clustering Communications and Theory of computation Algorithms and data structures Databases In order to solve the problems in the user profiles based on the traditional ontologies, we pi ontology extension algorithm Programming languages and to refine the user profiles. Before presenting it, we introduce an Artificial intelligence com pliers riginal subject ontology, which is defined by the Science Paper Concurrent, paralleL, and Online website(Sciencepaper Online, 2010). It is a taxonomy of distributed systems Computer graphics research subjects and has been in use on the Internet for many ears. This simple ontology consists of two levels of classification, Software engineering Scientific computing primary subjects and secondary subjects, and it holds is-a relation- ships between the subjects in different levels. In the first level, there Computer architecture Collaborative Networks are 43 primary subjects. Each primary subject has secondary sub- jects as its subordinate classifications. Fig. 2 shows the section of Fig. 2. The classification of"computer science"in the research paper subject onto- the primary subject"computer science "in this subject ontology
X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 89 Fig. 1. Framework for generating user interest profiles. stored in the paper database. Each paper in the paper database is classified according to the subject ontology and can readily be retrieved by users. The paper management module plays the role of fundamental component in the whole framework. (3) User behavior monitoring module. This module is responsible for the background collection of the behavior data of each user. The user behavior data include searching keywords, browsing, downloading and commenting on papers, etc. The monitoring and collecting processes are totally unobtrusive. (4) User profiling module. The user profiling module makes use of the user behavior data recorded by the user behavior monitoring module, the paper database and the subject ontology, to create user profiles. The user profiles obtained can be used to recommend papers to these users. We will elaborate our profiling approach in Section 6. 4. Automatic ontology extension through clustering weighted keyword graphs In order to solve the problems in the user profiles based on the traditional ontologies, we propose an ontology extension algorithm to refine the user profiles. Before presenting it, we introduce an original subject ontology, which is defined by the Science Paper Online website (Sciencepaper Online, 2010). It is a taxonomy of research subjects and has been in use on the Internet for many years. This simple ontology consists of two levels of classification, primary subjects and secondary subjects, and it holds is–a relationships between the subjects in different levels. In the first level, there are 43 primary subjects. Each primary subject has secondary subjects as its subordinate classifications. Fig. 2 shows the section of the primary subject “computer science” in this subject ontology. In the paper database storing the research paper data, each paper’s information contains its corresponding category mark based on the subject ontology. In this paper, vector space model is used to represent the research papers. Keywords are provided by the paper’s authors, representing the key content of each corresponding paper. In the vector space model, a paper is represented by a keyword vector, i.e., paper = (keyword1,..., keywordi,..., keywordn) (1 ≤ i ≤ n). The weight of the ith keyword keywordi in paperp is denoted as WKP(keywordi, paperp), which is computed by the TF-IDF Fig. 2. The classification of “computer science” in the research paper subject ontology
X Tang, Q Zeng/The Joumal of Systems and Software 85(2012)87-101 algorithm (Term Frequency-Inverse Document Frequency)(Salton and Buckley, 1988): here ps(keyword) is the set of papers containing keyword WKP(keyword, paper)is the weight of keyword in PKV (2)The set of edges CRC(KNS x KNS) represents the relations between different keyword nodes in KNS 1)WKP(keyword, papers)is the weight of the keyword weight of R where Ri is the edge linking keyword;and erp)is the keyword: Ps(keyword) is the paper set containing keyword, keyword in paper: (3)N indicates the the papers; (4)nk is the frequency of the keyword the function count() counts the number of papers of which the keyword vectors contain both keyword; and keyword. Le, PS(keyword)n PS(keyword) In the following discussions, we use PKVi (WKP(keyword, paperi),., WKP(keyword, paper i),..., WKP In the paper-keyword relation model (PKRM), if any two key (keywords, papern )) to represent paperi in which the quantitative quantitative word nodes in KNS are connected to one or more paper nodes in value of keyword is computed by the TF-IDF algorithm( Salton and PNS, then there is an edge connecting the two keyword nodes in uckley, 1988). By computing the TF-IDF values of keywords we CR of WKG WGT(Rii)indicates the number of co-occurrences of the obtain the keyword vectors of all papers. two keywords in papers. During the subsequent keyword clusterin The subject ontology we have discussed so far is still coarse process WGT(Rij )are changed granular and thus not precise enough to use. In order to produce For example, a weighted keyword graph of the subjectartifi a more finely granular ontology, we now introduce an automatic cial intelligence", which is created with Pajek( Batagelj and Mrvar, clustering algorithm for weighted keyword graph, which includes 1998), is illustrated in Fig 4. The keywords engaged are from 200 hree main steps, as follows papers in the subject of"artificial intelligence". The construction First, we define the paper-keyword relation model represented of the weighted keyword graph depends on the paper set. As the by a bipartite graph. A bipartite graph is a graph whose vertices paper set is changed, so the result may be different. can be divided into two disjoint sets U and V such that every edge nally, we cluster the weighted keyword graph to extend the connects a vertex in U to one in V: that is, U and V are independent subject ontology The keyword clustering is virtually the process of sets(West, 2000).Usually the authors specify a few keywords for increasing the edge weight threshold and finding a practical one. In their papers in order to demonstrate the contents or topic of the Fig 4, we can clearly see that keywords related to similar themes tend to cluster together. These keyword bundles naturally emerge papers to create the bipartite graphs between papers and keywords. and signify the tendency towards a concentration of the contents Here, we introduce the definition of the paper-keyword relation among similar papers. Through the keyword clustering process. these concentrated keyword bundles will be decomposed into sep Definition 1(Paper-keyword relation model). A paper-keyword arate ones. Now we present the keyword clustering procedure relation model is a bipartite graph PKRM=(PNS, KNS, RS), where (1)According to Definition 2, the weighted keyword graph WKG of (1)The set of nodes PNS= paper1, paper, papern represents a set of papers. (2)Define a threshold S2 for the weights of edges in WKG. If (2)The set of nodes KNS=(keywordl, keyword,. keywordmI WGT(Rij)<s2, the edge linking keyword and keyword; would t of keywords. be removed from WKG, where keyword keyword; e KNs and ()The set of edges RSS(PNS x KNS)U(KNS x PNS) indicates the binary relations between PNS and KNS. Suppose paper i e PNs (3)A set of connected components are produced as a result of Step keyword; e KNS and keyword; exists in PKVi, then ri E RS, where 2), with each connected component indicating one keyword rij is an edge connecting the node keyword and the node paper cluster. These keyword clusters are named"topics".We define a topic graph set TGS={TG1,……,TG,……,TGn} in which TGi rep- resents a topic, to represent all the topics generated An example of a paper-keyword relation model is shown in Fig. 3, in which the circle nodes represent research papers and the rectangle nodes represent keywords. The edges stand for the Here, we make three rules to facilitate the application of key relations between papers and keywords. word clustering. Secondly using the one-mode projection method(zanin et al 2008: Zhouet al,2007).we project paper-keyword relation models (1)Only when there are at least four keyword nodes in a cluster onto the keywords set KNS to generate weighted keyword graphs. is this keyword cluster considered to be a topic: the nodes in The one-mode projection onto keywords means a netw the clusters with keywords less than four, together with the taining only keyword nodes, in which two keyword nodes are keyword nodes with no connections, are collected into a special connected when they have at least one common neighboring paper topic, called"heterogeneous"topic. The "heterogeneous"topic node(zhou et al, 2007). In the following, we give the definition of is not included in user profiles, and therefore in the rest of this weighted keyword graphs paper the term "topic"only refers to usual keyword clusters that are generated by the keyword clustering process. Definition 2(Weighted keyword graph). A weighted keyword (2)We name the topics with the keywords which possess the graph is a graph WKG=(KNS, CR), where largest value of the sum of the weights of edges connected to them. If there is more than one keyword node with the highest (1)KNS is the node set representing keywords. For each connectivity, we choose one from them randomly keyword; E KNS. the weight of keyword; is calculated by (3)In the case that names of the clusters (topics) are the same s their immediate upper subjects' names, we would not con- Weight(keywords) WKP(keyword, paper) sider these kinds of clusters to be topics; that is, the keyword node with the highest connectivity in each cluster and the edges paperpE PS(keyword) linking to it are virtually neglected
90 X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 algorithm (Term Frequency-Inverse Document Frequency) (Salton and Buckley, 1988): WKP(keywordi, paperp) = tf(keywordi, paperp) · log((N/ni) + 0.01) n k=1 (tf(keywordk, paperp) · log((N/nk) + 0.01))2 , where (1) WKP(keywordi, paperp) is the weight of the keyword keywordi in paperp; (2) tf(keywordi, paperp) is the frequency of the keyword keywordi in paperp; (3) N indicates the total number of the papers; (4) nk is the frequency of the keyword keywordk in all papers. In the following discussions, we use PKVi = (WKP(keyword1, paperi), . . . ,WKP(keywordj, paperi), . . . ,WKP (keywordn, paperi)) to represent paperi, in which the quantitative value of keywordj is computed by the TF-IDF algorithm (Salton and Buckley, 1988). By computing the TF-IDF values of keywords we obtain the keyword vectors of all papers. The subject ontology we have discussed so far is still coarsely granular and thus not precise enough to use. In order to produce a more finely granular ontology, we now introduce an automatic clustering algorithm for weighted keyword graph, which includes three main steps, as follows. First, we define the paper–keyword relation model represented by a bipartite graph. A bipartite graph is a graph whose vertices can be divided into two disjoint sets U and V such that every edge connects a vertex in U to one in V; that is, U and V are independent sets (West, 2000). Usually the authors specify a few keywords for their papers in order to demonstrate the contents or topic of the papers. We collect the keywords and record their relations with papers to create the bipartite graphs between papers and keywords. Here, we introduce the definition of the paper–keyword relation model: Definition 1 (Paper–keyword relation model). A paper–keyword relation model is a bipartite graph PKRM = (PNS, KNS, RS), where (1) The set of nodes PNS = {paper1, paper2, ..., papern} represents a set of papers. (2) The set of nodes KNS = {keyword1, keyword2, ..., keywordm} represents a set of keywords. (3) The set of edges RS ⊆ (PNS × KNS)∪(KNS × PNS) indicates the binary relations between PNS and KNS. Suppose paperi ∈ PNS, keywordj ∈ KNS and keywordj exists in PKVi, then ri,j ∈ RS, where ri,j is an edge connecting the node keywordj and the node paperi. An example of a paper–keyword relation model is shown in Fig. 3, in which the circle nodes represent research papers and the rectangle nodes represent keywords. The edges stand for the relations between papers and keywords. Secondly, using the one-mode projection method (Zanin et al., 2008; Zhou et al., 2007), we project paper–keyword relationmodels onto the keywords set KNS to generate weighted keyword graphs. The one-mode projection onto keywords means a network containing only keyword nodes, in which two keyword nodes are connected when they have at least one common neighboring paper node (Zhou et al., 2007). In the following, we give the definition of weighted keyword graphs. Definition 2 (Weighted keyword graph). A weighted keyword graph is a graph WKG= (KNS, CR), where (1) KNS is the node set representing keywords. For each keywordi ∈ KNS, the weight of keywordi is calculated by Weight(keywordi) = paperp ∈ PS(keywordi) WKP(keywordi, paperp), where PS(keywordi) is the set of papers containing keywordi; WKP(keywordi, paperp) is the weight of keywordi in PKVp. (2) The set of edges CR ⊆ (KNS × KNS) represents the relations between different keyword nodes in KNS. (3) WGT(Ri,j) = count(PS(keywordi) ∩ PS(keywordj)) denotes the weight of Ri,j, where Ri,j is the edge linking keywordi and keywordj; PS(keywordi) is the paper set containing keywordi; the function count( ) counts the number of papers of which the keyword vectors contain both keywordi and keywordj, i.e., PS(keywordi) ∩ PS(keywordj). In the paper–keyword relation model (PKRM), if any two keyword nodes in KNS are connected to one or more paper nodes in PNS, then there is an edge connecting the two keyword nodes in CR of WKG. WGT(Ri,j) indicates the number of co-occurrences of the two keywords in papers. During the subsequent keyword clustering process WGT(Ri,j) are changed. For example, a weighted keyword graph of the subject “artifi- cial intelligence”, which is created with Pajek (Batagelj and Mrvar, 1998), is illustrated in Fig. 4. The keywords engaged are from 200 papers in the subject of “artificial intelligence”. The construction of the weighted keyword graph depends on the paper set. As the paper set is changed, so the result may be different. Finally, we cluster the weighted keyword graph to extend the subject ontology. The keyword clustering is virtually the process of increasing the edge weight threshold and finding a practical one. In Fig. 4, we can clearly see that keywords related to similar themes tend to cluster together. These keyword bundles naturally emerge and signify the tendency towards a concentration of the contents among similar papers. Through the keyword clustering process, these concentrated keyword bundles will be decomposed into separate ones. Now we present the keyword clustering procedure: (1) According to Definition 2, the weighted keyword graph WKG of a subject is generated. (2) Define a threshold ˝ for the weights of edges in WKG. If WGT(Ri,j) < ˝, the edge linking keywordi and keywordj would be removed from WKG, where keywordi, keywordj ∈ KNS and i =/ j. (3) A set of connected components are produced as a result of Step (2), with each connected component indicating one keyword cluster. These keyword clusters are named “topics”. We define a topic graph set TGS = {TG1, ..., TGi, ..., TGn} in which TGi represents a topic, to represent all the topics generated. Here, we make three rules to facilitate the application of keyword clustering. (1) Only when there are at least four keyword nodes in a cluster is this keyword cluster considered to be a topic; the nodes in the clusters with keywords less than four, together with the keyword nodes with no connections, are collected into a special topic, called “heterogeneous” topic. The “heterogeneous” topic is not included in user profiles, and therefore in the rest of this paper the term “topic” only refers to usual keyword clusters that are generated by the keyword clustering process. (2) We name the topics with the keywords which possess the largest value of the sum of the weights of edges connected to them. If there is more than one keyword node with the highest connectivity, we choose one from them randomly. (3) In the case that names of the clusters (topics) are the same as their immediate upper subjects’ names, we would not consider these kinds of clusters to be topics; that is, the keyword node with the highest connectivity in each cluster and the edges linking to it are virtually neglected.
Tang Q Zeng/The journal of systems and Software 85(2012)87-101 recommender systems ting estimation method Mender systems social tagging Recommendatin system Content-based filtering Profile Injection Attacks Shiling The papers represented by circles are 1. Towards the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions 2. Evaluating Collaborative Filtering Recommender Systems 3. Ontological User Profiling in Recommender Systems 4. Enriching Ontological User Profiles with Tagging History for Multi-Domain Recommendations 5. Individual and Group Behavior-based Customer Profile Model for Personalized Product Recommendation Towards Trustworthy Recommender Systems: An Analysis of Attack Models and Algorithm Robustness Fig 3. An example of a paper-keyword relation modeL. F: \LAB JOKKS\Pajek Network lode1AI!(C权重)met(72) □a LayoutGrsplDnly Frevieas Redrwe sext ZoomOut Options Export Move Info uncertan reasoning structue exd ctMe component at danaus ama movshon per tune fhie ecog deta acquision hmodal co 2y matching Fig 4. A weighted keyword graph of the subject "artificial intelligence"demonstrated by Pajek
X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 91 Fig. 3. An example of a paper–keyword relation model. Fig. 4. A weighted keyword graph of the subject “artificial intelligence” demonstrated by Pajek
X Tang, Q Zeng/The Joumal of Systems and Software 85(2012)87-101 82.F: 12len(无权重).net(3 ble touton men marooner are spicate temsimuirkont m patten theorem mage compreinte moble not man body. ever function°ay°a e wsidncrete wavelet transfomer elements d nevapeech emoton recogretionseindour n是pnd· conceplnet ladde chart● ef reco focused web clawing technclopar mage de-ndimwen ° voice recognior pust20°swmm是sdho°dsde° negatve. instaneous, setion on e executive tock ordona Didam de ● very greed ide&knapsack pn Paticle swam optimization9eham°shis· compositon dingus如 man visus systeme fao●pmgm是 the gorhen· nerve baric ●mcas· dala wenshou. stud dalaba是tma· atficin emotion modeition iverneb data mirPcemoonerh &relata·o●eod·e non-hneatly doey° contex-awa'·osd° service· colocation● fequency● ame recod· oopleyng levereduction mode puring methcdreedomiaion m speech emoti@teating methodweless senao anyang metfcompenion image selectioapeech emotortasb's caca modfied binomial ° data ming° decsion tee°m· waelet transfodata单mc01a· trng kemet° dynamic prog9mth是 x kene● vaal navision and orienta2是 maren w reverse eroneenng amann Fig. 5. A weighted keyword graph of the subject"artificial intelligence"(edge weight threshold equals 2). If we assign a different value to the threshold s2, different e The method of ontology extension through keyword clustering gences of TGS then will be produced During our experiment, in holds two advantages. It allows subject ontology to be automati which 200 papers were used, we discovered that, when 1 was used cally and sensitively adaptive to the changes of research topics in as the threshold of edge weight, TGS is as shown in Fig 4: when any subject. However scientific research topics change, whatever 3 was used as the threshold of edge weight, most of the keyword new research hotspots appear or whichever old research contents nodes in TGS became extremely discrete and only three keyword fade out, for example, the changes and the newest status will be clusters, within which very few keyword nodes existed maintained. immediately reflected in the new formation as the keyword cluster- The selection of the edge weight threshold is strongly pertinent to ing is executed. The ontology extension also makes the user interest ve use 2 as the threshold when clustering the weighted keyword are able to place a new angle of view with more accurate classifica- graph in Fig 4 and the result is showed in Fig. 5. In this weighted tions on all subjects: in this way, users' interests will be captured keyword graph, aside from the"heterogeneous"topic, there are ten and recorded even more clearly. In the next section, we will discuss other newly emerged topics which are listed in Table 1. Therefore the construction of user interest profiles the subordinate classifications of the secondary subjectartificial intelligence"are engendered. 5. Approaches to generating user profiles based on the So far, we have finished the extension of the secondary subject extended subject ontology artificial intelligence. All the secondary subjects in the original subject ontology are extended by the keyword clustering pro- We now use the extended subject ontology to create user pro- cess. An illustrative sketch of the result of the ontology extension files. In this section, we present our refined ontological profiling through keyword clustering is provided in Fig. 6. approach to solve the problems of the low accuracy of recommen- dation and of the coarse granularity of user interest profiles Topics of the secondary subject"artificial intelligence traditional ontology-based profiling algorithms. This user interest Topic nam Connectivity of central profiling approach is able to distinguish between the different con- keyword node tributions of the papers on the same topicto the construction ofuse neural network interest profiles. Also, apart from the user profile obtained directly rom the user behavior data, which we call the" explicit interest rofile, we apply implicit profiles to infer possible interests that users may develop in the future, in order to describe user interests expert system more roundly and thereby improve recommendation. A user inter est profile therefore consists of two parts: an explicit interest profile ansform and an implicit interest profile. An arbitrary user has an explicit Inversion set interest in a certain topic if the user directly accesses one or more traveling salesman problem papers in the topic. By contrast, an arbitrary users implicit interest
92 X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 Fig. 5. A weighted keyword graph of the subject “artificial intelligence” (edge weight threshold equals 2). If we assign a different value to the threshold ˝, different emergences of TGS then will be produced. During our experiment, in which 200 papers were used, we discovered that, when 1 was used as the threshold of edge weight, TGS is as shown in Fig. 4; when 3 was used as the threshold of edge weight, most of the keyword nodes in TGS became extremely discrete and only three keyword clusters, within which very few keyword nodes existed maintained. The selection of the edge weight threshold is strongly pertinent to the content and the quantity of the papers used in this process. Now we use 2 as the threshold when clustering the weighted keyword graph in Fig. 4 and the result is showed in Fig. 5. In this weighted keyword graph, aside from the “heterogeneous” topic, there are ten other newly emerged topics which are listed in Table 1. Therefore, the subordinate classifications of the secondary subject “artificial intelligence” are engendered. So far, we have finished the extension of the secondary subject “artificial intelligence”. All the secondary subjects in the original subject ontology are extended by the keyword clustering process. An illustrative sketch of the result of the ontology extension through keyword clustering is provided in Fig. 6. Table 1 Topics of the secondary subject “artificial intelligence”. Topic name Connectivity of central keyword node neural network 23 agent 20 SVM 16 genetic algorithms 14 expert system 11 hybrid attributes 10 ontology 9 wavelet transform 8 inversion set 6 traveling salesman problem 5 The method of ontology extension through keyword clustering holds two advantages. It allows subject ontology to be automatically and sensitively adaptive to the changes of research topics in any subject. However scientific research topics change, whatever new research hotspots appear or whichever old research contents fade out, for example, the changes and the newest status will be immediately reflected in the new formation as the keyword clustering is executed. The ontology extension also makes the user interest profiles more precise and distinct. With the clustering method, we are able to place a new angle of view with more accurate classifications on all subjects; in this way, users’ interests will be captured and recorded even more clearly. In the next section, we will discuss the construction of user interest profiles. 5. Approaches to generating user profiles based on the extended subject ontology We now use the extended subject ontology to create user pro- files. In this section, we present our refined ontological profiling approach to solve the problems of the low accuracy of recommendation and of the coarse granularity of user interest profiles in traditional ontology-based profiling algorithms. This user interest profiling approach is able to distinguish between the different contributions of the papers on the same topic to the construction of user interest profiles. Also, apart from the user profile obtained directly from the user behavior data, which we call the “explicit interest” profile, we apply implicit profiles to infer possible interests that users may develop in the future, in order to describe user interests more roundly and thereby improve recommendation. A user interest profile therefore consists of two parts: an explicit interest profile and an implicit interest profile. An arbitrary user has an explicit interest in a certain topic if the user directly accesses one or more papers in the topic. By contrast, an arbitrary user’s implicit interest
Tang Q Zeng/The journal of Systems and Software 85 (2012)87-101 COMPUTER SICENCE COMPUTER SICENCE Algorithms and data Algorithms and data tructures neural network Artificial intelligence Fig. 6. The result of the ontology extension. refers to the users interest in a certain topic about which no papers both keyword and keyword. where paper E UPS(user. have yet been directly accessed by the user. Through the ontolog UPS(user) represents the set of papers recorde ical relationships between this topic and the other topics in which behavior history data; keyword, keyword E UTG(topic usern s he user has explicit interests, the users implicit interest about the weights of the edges are defined as the number of related nis"uninteresting"topic can be predicted Implicit interest pro- files express users' potential interests in topics that are obscurely (4) The weight of node keyword in UtG which is referred as (keyword) measuring the user's interest in the content Definition 3(User interest profile). A user interest profile consists related to keyword; is calculated by of two parts: an explicit interest profile part and an implicit interest profile part, denoted by UEIPUseru)and UlIPUseTu), respectively Each of the two parts is a set of 2-tuples: UEIP(Useru)-( topic. EI( IK(keyword, )=> (WKP( eyword, paper).UIP(paper, user). topic, user ))ltopicr is a topic, El( topic, user)measures User: explicit interest in topic): and UIIPUseru=[( topick, ll( topick, where keyword; E KNS: PS(keyword. seru))ltopick is a topic, Il( topick, user)measures Useru's implicit set of papers con- interest in topick) are ups use s u): UPS(user )is the 5. 1. The profiling method for users' explicit interests WKP(keyword, paper)is the weight of keyword in the vector space model of paper And We now introduce the construction of the explicit interest pre file. The generation of explicit interest profiles is achieved through UIP(paper, user)=-BF(paper the user profiling module in Fig. 1. Fig. 7 shows a detailed process DP( paper,lse%,Φe(0,+∞) i> The key task of creating the explicit interest profiles of users lies number of times user browses paper, or the number of timese he chart of this module where BF(paper user)denotes behavior factor which equal nave explicit interest. Considering a paper's possible relevance to depending on the type of behavior based on whhs divided by 5 i computing the interest values of the topics in which the users downloads paper, or the score user rates paper different topics, the possible relevance of a keyword to different culated: DP(paperp, user )indicates the number of days that have topics is taken into account. In order to differentiate between the passed since the behavior occurred; p is a parameter for adjust pics for which a user has the same access behaviors, we assign a ment neasurement to each topic called the relevance factor. In the fol- d should be set as a value which optimizes the recommenda lowing, we first group of algorithms and definitions, and then tion performance. In the conventional ontology-based user interest ve present the definition and explanation of the relevance factor profiling algorithm, is set as 1(Middleton et al, 2004) and the explicit interest profiler. Definition 4(Inneredge ). An edge is termed an inner edge if it links Algorithm 1( Generating user-featured topic graph). Based on the two keyword nodes in the same weighted keyword definition of weighted keyword graph WKG, a user-featured topi graph, which is referred as UIG, can be created in the following edge strength. In this paper, we use IES(keyword, de iscalled to the inner edge strength of keyword, (1)Create the weighted keyword graph WKG(topic)for topic Definition 5 (Cross edge). An edge is termed as a cross edge if it (2) Eliminate all edges and reset weights of les to o to gen- links two nodes from different weighted keyword graphs. The sum erate a new graph UTG( topic). of the weights of the cross edges of a node is called cross edge (3)Based on the user's behavior history data, assign an edge strength. In is paper, we use CES(keyword) to refer to the cross tween keyword and keyword only if paper contains edge strength of keyword
X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 93 Fig. 6. The result of the ontology extension. refers to the user’s interest in a certain topic about which no papers have yet been directly accessed by the user. Through the ontological relationships between this topic and the other topics in which the user has explicit interests, the user’s implicit interest about this “uninteresting” topic can be predicted. Implicit interest pro- files express users’ potential interests in topics that are obscurely discovered. Definition 3 (User interest profile). A user interest profile consists of two parts: an explicit interest profile part and an implicit interest profile part, denoted by UEIP(Useru) and UIIP(Useru), respectively. Each of the two parts is a set of 2-tuples: UEIP(Useru) = {(topict, EI( topict, useru))|topict is a topic, EI( topict, useru) measures Useru ’ s explicit interest in topict}; and UIIP(Useru) = {(topic, II( topic, useru))|topic is a topic, II( topic, useru) measures Useru ’ s implicit interest in topic}. 5.1. The profiling method for users’ explicit interests We now introduce the construction of the explicit interest pro- file. The generation of explicit interest profiles is achieved through the user profiling module in Fig. 1. Fig. 7 shows a detailed process chart of this module. The key task of creating the explicit interest profiles of users lies in computing the interest values of the topics in which the users have explicit interest. Considering a paper’s possible relevance to different topics, the possible relevance of a keyword to different topics is taken into account. In order to differentiate between the topics for which a user has the same access behaviors, we assign a measurement to each topic called the relevance factor. In the following, we first give a group of algorithms and definitions, and then we present the definition and explanation of the relevance factor and the explicit interest profiler. Algorithm 1 (Generating user-featured topic graph). Based on the definition of weighted keyword graph WKG, a user-featured topic graph, which is referred as UTG, can be created in the following steps: (1) Create the weighted keyword graph WKG(topict) for topict. (2) Eliminate all edges and reset weights of all nodes to 0 to generate a new graph UTG(topict). (3) Based on the user’s behavior history data, assign an edge between keywordi and keywordj only if paperp contains both keywordi and keywordj, where paperp ∈ UPS(useru) and UPS(useru) represents the set of papers recorded in useru’s behavior history data; keywordi, keywordj ∈ UTG(topict). And the weights of the edges are defined as the number of related papers. (4) The weight of node keywordi in UTG which is referred as IK(keywordi) measuring the user’s interest in the content related to keywordi is calculated by IK(keywordi) = paperp ∈ PS(keywordi) (WKP(keywordi, paperp) · UIP(paperp, useru)), where keywordi ∈ KNS; PS(keywordi) is the set of papers containing keywordi and PS(keywordi) ∈ UPS(useru); UPS(useru) is the set of papers recorded in useru’s behavior history database; WKP(keywordi, paperp) is the weight of keywordi in the vector space model of paperp. And UIP(paperp, useru) = BF(paperp, useru) (DP(paperp, useru))1/˚ , ˚ ∈ (0, +∞), where BF(paperp, useru) denotes behavior factor which equals the number of times useru browses paperp, or the number of times useru downloads paperp, or the score useru rates paperp divided by 5, depending on the type of behavior based on which TG(topict) is calculated; DP(paperp, useru) indicates the number of days that have passed since the behavior occurred; ˚ is a parameter for adjustment. ˚ should be set as a value which optimizes the recommendation performance. In the conventional ontology-based user interest profiling algorithm, ˚ is set as 1 (Middleton et al., 2004). Definition 4 (Inner edge). An edge is termed an inner edge if it links two keyword nodes in the same weighted keyword graph. The sum of the weights of the inner edges of a keyword node is called inner edge strength. In this paper, we use IES(keywordi) to refer to the inner edge strength of keywordi. Definition 5 (Cross edge). An edge is termed as a cross edge if it links two nodes from different weighted keyword graphs. The sum of the weights of the cross edges of a node is called cross edge strength. In this paper, we use CES(keywordi) to refer to the cross edge strength of keywordi.
X Tang, Q Zeng/The Joumal of Systems and Software 85(2012)87-101 Weighted keyword graph based interest Projection Extended Subject Ontology Explicit interest User Behavior Database esO Paper Database Fig. 7. Process of generating explicit user interest profiles. Definition 6( Relevance strength). Based on the definitions ofinner gain(IG), mutual information(Mi), chi-square and expected cross edge and cross edge, the relevance strength RS(keyword, topic )of entropy(ECE)(Sebastiani and ricerche, 2002; Yang, 1995 ). Most keyword towards topic can be calculated by of these functions try to capture the intuition according to which IES(keyword the most valuable terms for a certain categorization are those that RS(keyword, topic)= are distributed most differently in the sets of positive and neg- ative examples of the category(Sebastiani and ricerche, 2002). which quantifies the relevance of keyword to topic. From Instead of employing these existing techniques, we propose the the definition of relevance strength we can determine that new measurement since it is expressly designed for our automated RS(keyword, topic e(0, 1 and IES(keyword)>0. classification method and it calculates the tightness between dif- Definition 7(Relevance factor). The relevance factor RFuser. ferent keywords well. The posterior experiment testifies its good pict indicating the measurement of the relevance of users inter- erformance in efficiency and accura est towards topic is given as 5.2. Profiling method for users'implicit interests RF( (user,topi)=(.UG元Gv.i),λ∈[1,+∞), Many researchers have hasized the importance of the in which TGV denotes a vector comprising the weights of all nodes novelty of the recommended items in recommender systems in UTG(topic)in the sequence as same as TGV: A is an adjustable for analyzing recommender systems that considered the"nonob- A should be assigned a value to maximize the recommendation viousness"of the recommendation: novelty and serendipity. Zhou performance. We will discuss this in Section 6. et al. (2010)sought to gain in both diversity and accuracy of mendations, and argued that"real value is found in the ability to The relevance factor quantifies the pertinence of the contents suggest objects users would not readily discover for themselves. of a certain topic that a user accesses to the most essential content that is, in the novelty and diversity of recommendation". They of the topic, with the purpose of measuring the relevance between employed two metrics for measuring recommendation diversit the users interest and the topic. Compared with other profiling strategies, ontology-based pro- Now, we present the algorithm of the explicit interest profiler. filing methods have the advantage that they can infer userinterests Algorithm 2 (Explicit interest The explicit interest of user on the basis of hierarchical structures and are able to utilize the relations between their ontologies and external concepts for ommendations. Due to ontological inference, user profiles can be rounded off and can be matched better to the wide range of user RF(ueop)∑ tonic UIP paper;uge terests(Felden and Linden, 2007). In Middleton et al. (2001, 2003, 2004), researchers used is-a relationships between subclasses and Ar stoni UIP(paperpi, useTo) their immediate super classes by adding 50% of the interest val ues of the classes to those of the super-classes. Trials substantiated where UTS(user)signifies the set of topics that user has explicit that ontological inference boosted the recom Interests in mendation accuracy of individual recommendations To some extent. the measurement of the relevance factor is Instead of taking advantage of the is-a relationship between similar to the feature selection techniques in text classification, subordinate topics and their super-topics, we believe that the which forms the basis of the task. Well-developed feature selec- semantic relations between topics under the same super-topics tion techniques include document frequency(DF). information are more preferable for inferring user interests. In our extended
94 X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 Fig. 7. Process of generating explicit user interest profiles. Definition 6 (Relevance strength). Based on the definitions of inner edge and cross edge, the relevance strength RS(keywordi, topict) of keywordi towards topict can be calculated by RS(keywordi, topict) = IES(keywordi) IES(keywordi) + CES(keywordi) , which quantifies the relevance of keywordi to topict. From the definition of relevance strength we can determine that RS(keywordi, topict) ∈ (0, 1] and IES(keywordi) > 0. Definition 7 (Relevance factor). The relevance factor RF(useru, topict) indicating the measurement of the relevance of useru’s interest towards topict is given as RF(useru, topict) = ((−−→TGV · −−−→ UTGV)/( −−→TGV · 1)) 1/, ∈ [1, +∞), in which −−→TGV denotes a vector comprising the weights of all nodes in TG(topict); −−−→ UTGV denotes a vector listing the weights of all nodes in UTG(topict) in the sequence as same as −−→TGV; is an adjustable parameter. The setting of depends on experimental data, and should be assigned a value to maximize the recommendation performance. We will discuss this in Section 6. The relevance factor quantifies the pertinence of the contents of a certain topic that a user accesses to the most essential content of the topic, with the purpose of measuring the relevance between the user’s interest and the topic. Now, we present the algorithm of the explicit interest profiler. Algorithm 2 (Explicit interest profiler). The explicit interest of useru in topict EI( topict, useru) is computed by EI(topict, useru) = RF(useru, topict) · paperp ∈ topict UIP(paperp, useru) topicti ∈ UTS(useru) RF(useru, topicti) paperpi ∈ topicti UIP(paperpi, useru) , where UTS(useru) signifies the set of topics that useru has explicit interests in. To some extent, the measurement of the relevance factor is similar to the feature selection techniques in text classification, which forms the basis of the task. Well-developed feature selection techniques include document frequency (DF), information gain (IG), mutual information (MI), chi-square and expected cross entropy (ECE) (Sebastiani and Ricerche, 2002; Yang, 1995). Most of these functions try to capture the intuition according to which the most valuable terms for a certain categorization are those that are distributed most differently in the sets of positive and negative examples of the category (Sebastiani and Ricerche, 2002). Instead of employing these existing techniques, we propose the new measurement since it is expressly designed for our automated classification method and it calculates the tightness between different keywords well. The posterior experiment testifies its good performance in efficiency and accuracy. 5.2. Profiling method for users’ implicit interests Many researchers have emphasized the importance of the novelty of the recommended items in recommender systems. Herlocker et al. (2004) asserted that new dimensions were needed for analyzing recommender systems that considered the “nonobviousness” of the recommendation: novelty and serendipity. Zhou et al. (2010) sought to gain in both diversity and accuracy of recommendations, and argued that “real value is found in the ability to suggest objects users would not readily discover for themselves, that is, in the novelty and diversity of recommendation”. They employed two metrics for measuring recommendation diversity. Compared with other profiling strategies, ontology-based pro- filing methods have the advantage that they can infer user interests on the basis of hierarchical structures and are able to utilize the relations between their ontologies and external concepts for recommendations. Due to ontological inference, user profiles can be rounded off and can be matched better to the wide range of user interests (Felden and Linden, 2007). In Middleton et al. (2001, 2003, 2004), researchers used is–a relationships between subclasses and their immediate super classes by adding 50% of the interest values of the classes to those of the super-classes. Trials substantiated that ontological inference boosted the recommendation accuracy of individual recommendations. Instead of taking advantage of the is–a relationship between subordinate topics and their super-topics, we believe that the semantic relations between topics under the same super-topics are more preferable for inferring user interests. In our extended
X Tang QZeng/The journal of systems and Software 85(2012 )87-101 Algorithm 3(Implicit interest profiler). Let vC(topici) denotes the topic connectivity of the node topici, which equals the sum of the weights of all edges linked to the node topici: let Il(topic, user)be users implicit interest value for topix. Then, Il(topicx, user u) can be com- WET(ELx) topic j ll(topiX, user) vC(topic). vC(topig.El(to the set of topic nodes neighboring topic in m (topic)stands for in which topic∈ ITNS and topic∈EINS: Neighbor. topic topic 6. Prototype system and empirical evaluation 6. 1. SPRS-a prototype for scientific paper recommender system Fig 8. An example of topic network graph. In order to assess the precision and effectiveness of our subject ontology the semantic relations between topics can approach, we developed a scientific paper recommende system(SPRS), which is illustrated in Fig 9. There are three columns btained by an inverse process of keyword clustering. Here, we in the homepage of SPRS. The left column consists of three parts give the definition of topic network graphs to reveal these semantic relations. user information, the papers which are currently read by the cur rent user and the papers that the current user wants to read. The Definition8(Topic network graph). A topic network graph is a middle column presents the paper recommendations separately graph TNG=(ETNS. ITNS, ES), where based on the records of three behaviors download browse and comment. In the right column, three lists of papers in the behavioral (1) ETNS is the node set in which each node represents a topic in records are presented. which one or more papers are ever accessed by user, the weight Based on the user profiles obtained, SPRS is able to provide of the node topic equals El( topic, user) paper recommendations to users. Recommendations are produced by correlating the users' topics of interest and the papers class ( 2)ITNS is the node set in which each node represents a topic in fied to these topics. The explicit interests and implicit interests of hich no papers are ever accessed by user, the weight of the node topix equals the measurement of user 's implicit interest a certain user are both, separately, used for recommendation. The in topix, which is denoted as ll( topic, user). The calculation top-N items rule is implemented to control the number of recom- of ll( topic, user) is given in Algorithm 3 mendations. Papers are ranked in the order of the recommendation ()The set of edges ES C((ETNSUIINS)x(ETNSU ITNS), represent- confidence of them when being presented to a specific user;a the relevance between different topics in TNS. papers recommendation confidence for a user equals the produc of the classification confidence of this paper in a topic that maxi (4)WET(Ey)=sum(MES(topic topic), where Ey represents the mizes this paper's classification confidence and this user's degree ge connecting topic and topic: WET(Ei) denotes the weight of interest in this topic. Depending on which interest profile is used responds to an edge in WKG connecting one keyword belongs to interest or implicit interest st degree can be of either explicit topici and one keyword belongs to topic: the function sum()cal culates the summation of the weights of the edges in MES(topic The behavior history of users comment histories, are recorded and used to calculate their interest profiles. Also, papers that have been accessed by users before will be ignored when recommendations are provided The recommen- Thereby, we can now obtain a topic network graph for each sec- dations are presented separately. If users are interested in these ondary subject. Fig8 shows a sample of a topic network graph. The recommendations, they certainly tend to click on them, unless demonstration is provided for illustrative purposes only. In Fig.8. these items have already been visited the yellow circles represent topics belong to ETNS and the white circles represent topics belong to ITNS. 6.2. Empirical evaluation The relevance between topics is a meaningful way to infer implicit interests. As long as a user has explicit interests in some In this section, we conducted three experiments by SPrs has not yet explored can be estimated reasonably by the following ontology-based profiling approach proposed in this paper tended pics, the implicit interests of the user in the topics that the user to examine the ntology extens nethod and the ex algorithm. 6.2.1. Experiment 1: the acquisition of the extended subject ontology Topics of the secondary subject"databases First, we evaluated the extension of the original subject ontology by generating keyword clusters for the weighted keyword graphs onnectivity of central ofartificial intelligence "and"databases"subject. We used 200 papers in each subject in the following experiment. The papers data mining were from the Science Paper Online website(Sciencepap 2010). The clustering result of the"artificial intelligence"subject 10 has already been used as the examples shown in Section 4. In this section, we only discuss the clustering result of"databases". The weighted keyword graph of"databases"in which the edge weight spatial data mining threshold value is 1 is showed in Fig. 10. Then we assign 2 as the
X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 95 Fig. 8. An example of topic network graph. subject ontology, the semantic relations between topics can be obtained by an inverse process of keyword clustering. Here, we give the definition of topic network graphs to reveal these semantic relations: Definition 8 (Topic network graph). A topic network graph is a graph TNG= (ETNS, ITNS, ES), where (1) ETNS is the node set in which each node represents a topic in which one ormore papers are ever accessed by useru, the weight of the node topici equals EI( topict, useru). (2) ITNS is the node set in which each node represents a topic in which no papers are ever accessed by useru, the weight of the node topicx equals the measurement of useru’s implicit interest in topicx, which is denoted as II( topict, useru). The calculation of II( topict, useru) is given in Algorithm 3. (3) The set of edges ES ⊆ ((ETNS∪ITNS) × (ETNS∪ITNS)), representing the relevance between different topics in TNS. (4) WET(Ei,j) = sum(MES(topici, topicj)), where Ei,j represents the edge connecting topici and topicj; WET(Ei,j) denotes the weight of Ei,j; MES(topici, topicj) is the edge set in which each edge corresponds to an edge in WKG connecting one keyword belongs to topici and one keyword belongs to topicj; the function sum( ) calculates the summation of the weights of the edges in MES(topici, topicj). Thereby, we can now obtain a topic network graph for each secondary subject. Fig. 8 shows a sample of a topic network graph. The demonstration is provided for illustrative purposes only. In Fig. 8, the yellow circles represent topics belong to ETNS and the white circles represent topics belong to ITNS. The relevance between topics is a meaningful way to infer implicit interests. As long as a user has explicit interests in some topics, the implicit interests of the user in the topics that the user has not yet explored can be estimated reasonably by the following algorithm. Table 2 Topics of the secondary subject “databases”. Topic name Connectivity of central keyword node data mining 32 SQL 14 clustering 11 data warehouse 10 distributed database 9 relational database 9 spatial data mining 7 Algorithm 3 (Implicit interest profiler). Let VC(topici) denotes the connectivity of the node topici, which equals the sum of the weights of all edges linked to the node topici; let II(topicx, useru) be useru’s implicit interest value for topicx. Then, II(topicx, useru) can be computed as follows: II(topicx, useru) = topici ∈ Neightbors(topicx ) WET(Ei,x) 2 VC(topicx) · VC(topici) · EI(topici, useru) , in which topicx ∈ ITNS and topici ∈ ETNS; Neighbors(topicx) stands for the set of topic nodes neighboring topicx in TNG. 6. Prototype system and empirical evaluation 6.1. SPRS—a prototype for scientific paper recommender system In order to assess the precision and effectiveness of our approach, we developed a scientific paper recommender prototype system (SPRS), which is illustrated in Fig. 9. There are three columns in the homepage of SPRS. The left column consists of three parts: user information, the papers which are currently read by the current user and the papers that the current user wants to read. The middle column presents the paper recommendations separately based on the records of three behaviors: download, browse and comment. In the right column, three lists of papers in the behavioral records are presented. Based on the user profiles obtained, SPRS is able to provide paper recommendations to users. Recommendations are produced by correlating the users’ topics of interest and the papers classi- fied to these topics. The explicit interests and implicit interests of a certain user are both, separately, used for recommendation. The top-N items rule is implemented to control the number of recommendations. Papers are ranked in the order of the recommendation confidence of them when being presented to a specific user; a paper’s recommendation confidence for a user equals the product of the classification confidence of this paper in a topic that maximizes this paper’s classification confidence and this user’s degree of interest in this topic. Depending on which interest profile is used for recommendation, the interest degree can be of either explicit interest or implicit interest. The behavior history of users, including browse, download and comment histories, are recorded and used to calculate their interest profiles. Also, papers that have been accessed by users before will be ignored when recommendations are provided. The recommendations are presented separately. If users are interested in these recommendations, they certainly tend to click on them, unless these items have already been visited. 6.2. Empirical evaluation In this section, we conducted three experiments by SPRS to examine the ontology extension method and the extended ontology-based profiling approach proposed in this paper. 6.2.1. Experiment 1: the acquisition of the extended subject ontology First, we evaluated the extension of the original subject ontology by generating keyword clusters for the weighted keyword graphs of “artificial intelligence” and “databases” subject. We used 200 papers in each subject in the following experiment. The papers were from the Science Paper Online website (Sciencepaper Online, 2010). The clustering result of the “artificial intelligence” subject has already been used as the examples shown in Section 4. In this section, we only discuss the clustering result of “databases”. The weighted keyword graph of “databases” in which the edge weight threshold value is 1 is showed in Fig. 10. Then we assign 2 as the
X Tang, Q Zeng/The Joumal of Systems and Software 85(2012)87-101 !r! Paper Recommendation according to Download History Cption. Download history option findno thermite Option. ata warehouse of cc ree aftermarket edentata warehause tssbnocootthe paper Recommendation according to Comment History Option -Comment history ALRight Reserved Fig 9. Scientific paper recommender system(SPRS). Bl. F: \LAB FORKS\Pajek Network Bodel\dat abase ll en, net(377) Layout Graphonly Frevious Redr a Next anagement comedon cocffoenl uk-stream data data clean gating ght charges namc tue comping light pal ector paty e wnl diastems data nal privacy constant on predicates Fig. 10. The weighted keyword graph of"databases"whose edge weight threshold value is 1
96 X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 Fig. 9. Scientific paper recommender system (SPRS). Fig. 10. The weighted keyword graph of “databases” whose edge weight threshold value is 1