A Paper Recommendation Mechanism for the Research Support System Papits Satoshi Watanabe, Takayuki Ito, Tadachika Ozono and toramatsu shintani Graduate School of Engineering, Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya, Aichi, 466-8555 Japan watanabe, itota, ozono, tora @ics. nitech ac jp Abstract called Papits[8J118(13. Papits has several functions that allow it to manage research information by paper sharing, ve have developed Papits, a research support system, paper recommending, paper retrieving, paper classifying that shares research information, such as PDF files of re search papers, in computers on networks and classifies the the sharing of research information, such as the PDF files of nformation into research types. Papits users can share var- research papers, and to collect papers from Web sites ious research information and survey the corpora of their The recommendation function constructs a users model particular fields. To develop Papits, we need to design to determine a user's research interests and specialties. This mechanism to identify a user's interest. Also, when cor model is constructed by analyzing structing an effective paper recommendation system, it is user has viewed and enable them to recommend papers important to carefully create user's models. We propose a ased on their interest. The recommendation in Papits grad- ually improves accuracy through paper viewing history of work. The scale-free network has vertices and edges, and users. This particular paper focuses on the paper recom ensures growth by ' preference attachments. Our method mendation. One of the main problems associated with the applies a paper viewing history to construct a scale-free recommendation is how to reduce information overload and network based on the word co-occurrence. a constructed realize a precise and accurate recommendation. In conven network consists of vertices that represent words, and edge tional research of natural language processes, the TF-IDF that represent the word co-occurrence In our method, a pa method[21]et al., was used to give added weight to words per is added to the network as indicated by a user's paper for searching or summation. Also, the TF-IDF method is viewing history. Additionally we define the 'topic weight often used to calculated the importance of and a similari- By using two elements; the topic frequency and the topic re- ties between documents. The calculation, however, does not cency, we calculate the topic weight. By using the word co- take the differences between the users'interest into account occurrence in a database, we measure the topic frequency. As each user's interest is different, so a weight of word is Also, by using the Jaccard coefficient, we measure the topic different for every user. We proposed and applied a recom- recency. Our result indicates that our method can effec mendation mechanism to Papits that uses the user's paper tively recommend documents for Papits users viewing history to reflect a user's interests or specialties By using a recommendation mechanism, we can discover several papers in various databases, but each paper can be classified according to the following characteristics 1. Introduction (I)Does the paper have an important fact or not? (2)Does the paper have a novel fact or a known fact? ()Does the paper have a fact that is interesting to the user As information technology becomes an indispensable part of our daily life, huge amount of information is shared When constructing a user's model, ideally we would like throughout the world. The speed and amount of this shar to discover papers that are important, novel, and of inter ing has accelerated with the advent of the Internet and users est to the user. Conventional recommendation mechanisms are becoming overloaded with information. with such valid mainly deal with the characteristics(1); the importance of and noisy information, we need tools to identify useful in- formation or knowledge that meets demands of individual mechanisms rank papers by using the precision and recall user. So, we have developed a research support system 'PAPer Information Tailor System Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
A Paper Recommendation Mechanism for the Research Support System Papits Satoshi Watanabe, Takayuki Ito, Tadachika Ozono and Toramatsu Shintani Graduate School of Engineering, Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya, Aichi, 466-8555 Japan {watanabe, itota, ozono, tora}@ics.nitech.ac.jp Abstract We have developed Papits, a research support system, that shares research information, such as PDF files of research papers, in computers on networks and classifies the information into research types. Papits users can share various research information and survey the corpora of their particular fields. To develop Papits, we need to design a mechanism to identify a user’s interest. Also, when constructing an effective paper recommendation system, it is important to carefully create user’s models. We propose a method to construct user’s models using the scale-free network. The scale-free network has vertices and edges, and ensures growth by ‘preference attachments’. Our method applies a paper viewing history to construct a scale-free network based on the word co-occurrence. A constructed network consists of vertices that represent words, and edges that represent the word co-occurrence. In our method, a paper is added to the network as indicated by a user’s paper viewing history. Additionally we define the ‘topic weight’. By using two elements; the topic frequency and the topic recency, we calculate the topic weight. By using the word cooccurrence in a database, we measure the topic frequency. Also, by using the Jaccard coefficient, we measure the topic recency. Our result indicates that our method can effectively recommend documents for Papits users. 1. Introduction As information technology becomes an indispensable part of our daily life, huge amount of information is shared throughout the world. The speed and amount of this sharing has accelerated with the advent of the Internet and users are becoming overloaded with information. With such valid and noisy information, we need tools to identify useful information or knowledge that meets demands of individual user. So, we have developed a research support system, called Papits1[8][18][13]. Papits has several functions that allow it to manage research information by paper sharing, paper recommending, paper retrieving, paper classifying and a research diary. The paper sharing function facilitates the sharing of research information, such as the PDF files of research papers, and to collect papers from Web sites. The recommendation function constructs a user’s model to determine a user’s research interests and specialties. This model is constructed by analyzing research papers that a user has viewed and enable them to recommend papers based on their interest. The recommendation in Papits gradually improves accuracy through paper viewing history of users. This particular paper focuses on the paper recommendation. One of the main problems associated with the recommendation is how to reduce information overload and realize a precise and accurate recommendation. In conventional research of natural language processes, the TF-IDF method[21] et al., was used to give added weight to words for searching or summation. Also, the TF-IDF method is often used to calculated the importance of and a similarities between documents. The calculation, however, does not take the differences between the users’ interest into account. As each user’s interest is different, so a weight of word is different for every user. We proposed and applied a recommendation mechanism to Papits that uses the user’s paper viewing history to reflect a user’s interests or specialties. By using a recommendation mechanism, we can discover several papers in various databases, but each paper can be classified according to the following characteristics. (1)Does the paper have an important fact or not? (2)Does the paper have a novel fact or a known fact? (3)Does the paper have a fact that is interesting to the user at the present moment? When constructing a user’s model, ideally we would like to discover papers that are important, novel, and of interest to the user. Conventional recommendation mechanisms mainly deal with the characteristics(1); the importance of papers, for example, by using a statistics approach. Some mechanisms rank papers by using the precision and recall 1PAPer Information Tailor System Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
value of each rule. However, it is not easy to deal with other B characteristics, as the novelty and significance of the paper to the user may change over time To deal with the characteristics(2) and (3), we utilized the user's paper viewing history. This allowed us to check whether or not a paper is novel. Moreover, this monitoring enabled us to also determine a user's preference and interest and check whether or not a paper is of interest at the present moment. Additionally, we define thetopic model that has two elements: the topic frequency and the topic recency. By using the word co-occurrence in a database we measure the opic frequency. Also, by using the Jaccard coefficient, we measure the topic recency In the first section we will describe the recommendation algorithm for managing research papers. Second, we will outline out the Papits research support system. Third, we will discuss the our results using our algorithm and proves Figure 1. A conception of scale-free network ts usefulness. Fourth, we will compare our work with ilar researches. Finally, we will conclude with a brief 2. Paper recommendation mechanism mo). The probability that a vertex attaches to another ver- tex i is proportional to the rate an edge hi which a vertex i When constructing the user s model, we used the scale- has3J[] free network to measure the frequency and recency of words. The scale-free network has the characteristics of growthand preferential attachment. We also use'fit I1=I(k)≡ =7-1+mo (0≤i<r+mo)(1) ness of vertices in the network. The fitness is the probabil ity when a new vertex is added to the network. So, when constructing a network, we look upon papers which a user The probability P(k)that a vertex in the network inter- possesses as a user's interests and specialities. We repre- acts with k other vertices decays as a power law, following sented the words contained in papers that a user possesses P(k) as the vertices in the network and the word co-occurrence The collaboration graph of movie actors represents a as the edges in network. well-documented example of a social network. Each tor is represented by a vertex, two actors being connected if 2. 1. Scale-free network they were cast together in the same movie. The probability that an actor has k links(characterizing his or her popu- larity) has a power-law tail for large k, following P(k) The scale-free network results from two generic mecha- actor, where 2.3+0.1. A more complex net nisms:(i) networks continuously expand with the addition work with over 800 million vertices is the www, where a of new vertices, and ( ii) new vertices preferentially attach to vertex is a document and the edges are the links pointing edges that are already well connected. The scale -free net from one document to another. The topology of this graph work concept is as follows determines the Web's connectivity and, consequently, our (i)iniTially,thenetworkhasnoedgesandmoverticeseffectivenessinlocatinginformationonthewww.Infor (Figure 1 A) In Figure 1, a o means an added vertex, a. mation about P(k)can be obtained using robots, indicating means the existing vertices. The network grow sequentially at the probability that k documents point to a certain Web from A to B, and from B to C in Figure I every step?. The page follows a power law, with y, 2.1±0.1[3 added vertex attaches the existing m vertices. The added Real networks have a competitive aspect, as each node vertex attaches the existing m vertices and these processes has an intrinsic ability to compete for edges at the expense are called'growth'in the scale-free network. of other vertices[10]. They propose a model in which each (ii)As the network adds new vertices(i= y+ mo), the node is assigned a fitness parameter ni which does not vertices preferentially select a vertex which already well change in time. Thus at every time step a new node j with a connected from the existing vertices(i=0,1,.,y-1+ fitness n; is added to the system, where n, is chosen from a Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
value of each rule. However, it is not easy to deal with other characteristics, as the novelty and significance of the paper to the user may change over time. To deal with the characteristics(2) and (3), we utilized the user’s paper viewing history. This allowed us to check whether or not a paper is novel. Moreover, this monitoring enabled us to also determine a user’s preference and interest and check whether or not a paper is of interest at the present moment. Additionally, we define the ‘topic model’ that has two elements: the topic frequency and the topic recency. By using the word co-occurrence in a database, we measure the topic frequency. Also, by using the Jaccard coefficient, we measure the topic recency. In the first section, we will describe the recommendation algorithm for managing research papers. Second, we will outline out the Papits research support system. Third, we will discuss the our results using our algorithm and proves its usefulness. Fourth, we will compare our work with similar researches. Finally, we will conclude with a brief summary. 2. Paper recommendation mechanism When constructing the user’s model, we used the scalefree network to measure the frequency and recency of words. The scale-free network has the characteristics of ‘growth’ and ‘preferential attachment’. We also use ‘fitness’ of vertices in the network. The fitness is the probability when a new vertex is added to the network. So, when constructing a network, we look upon papers which a user possesses as a user’s interests and specialities. We represented the words contained in papers that a user possesses as the vertices in the network, and the word co-occurrence as the edges in network. 2.1. Scale-free network The scale-free network results from two generic mechanisms: (i) networks continuously expand with the addition of new vertices, and (ii) new vertices preferentially attach to edges that are already well connected. The scale-free network concept is as follows: (i)Initially, the network has no edges and m0 vertices (Figure 1 A). In Figure 1, a ◦ means an added vertex, a • means the existing vertices. The network grow sequentially from A to B, and from B to C in Figure 1 every step γ. The added vertex attaches the existing m vertices. The added vertex attaches the existing m vertices and these processes are called ‘growth’ in the scale-free network. (ii)As the network adds new vertices(i = γ + m0), the vertices preferentially select a vertex which already well connected from the existing vertices(i = 0, 1, ..., γ − 1 + Figure 1. A conception of scale-free network m0). The probability that a vertex attaches to another vertex i is proportional to the rate an edge ki which a vertex i has[3][1]. Πi = Π(ki) ≡ ki j=τ−1+m0 j=0 kj (0 ≤ i<τ + m0) (1) The probability P(k) that a vertex in the network interacts with k other vertices decays as a power law, following P(k) ∼ k−γ. The collaboration graph of movie actors represents a well-documented example of a social network. Each actor is represented by a vertex, two actors being connected if they were cast together in the same movie. The probability that an actor has k links (characterizing his or her popularity) has a power-law tail for large k, following P(k) ∼ kγactor , where γactor = 2.3 ± 0.1. A more complex network with over 800 million vertices is the WWW, where a vertex is a document and the edges are the links pointing from one document to another. The topology of this graph determines the Web’s connectivity and, consequently, our effectiveness in locating information on the WWW. Information about P(k) can be obtained using robots, indicating that the probability that k documents point to a certain Web page follows a power law, with γwww = 2.1 ± 0.1[3]. Real networks have a competitive aspect, as each node has an intrinsic ability to compete for edges at the expense of other vertices[10]. They propose a model in which each node is assigned a fitness parameter ηi which does not change in time. Thus at every time step a new node j with a fitness ηj is added to the system, where ηj is chosen from a Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
distribution r(n). Each new vertex connects with m edges to the vertices already in the network, and the probability of Database connecting to a vertex i is proportional to the degree and the fitness of vertex i. Topic model It is well known that the more frequent a word, the relevant pape send viewed papers available it is for production and comprehension processes This phenomenon is known as the frequency(referring to User the whole individual,s experience)or recency(referring to the recent individual's experience)effect. This phenomenon shows that preferential attachment is very likely to shape the Figure 2. A creation procedure of user's scale-free distribution of degrees[7] To deal with the characteristics(2)and(3), we looked pon words in papers which a user possesses as the vertices of the scale-free network. and word co-occurrence as the edges. We also calculated the frequency and the fitness of words from the Equation 2. Checking the user's interests or specialities secularly, we considered that the user's interests or specialities are determined using words that frequentl The construction of the user's model[ 22) can be seen in appear in the paper viewing history, and words which ap- Figure2. The user model generation mechanism uses pa pears most recently. Namely, we represent the frequency of pers which a user possesses to eliminate stopwords, pre- the network as the user's longer-term interest and a fitness processing based on making stems, and construct or adjust of network as the user's shorter-term interest the user's model. The comparison and selection mechanism seen in Figure 2. The constructed user's model is compared to papers in a Papits database. By comparing the users 2. 2. Construction of users model paper viewing history to the user's model, Papits can rec- ommend papers which are of interest to the user. Figure This section outlines the construction of the user model 3 represents the user's model made from a paper[7] as the based on the user's paper viewing history. Our method uses preprocessed network. Each vertex of Figure 3 is described papers that a user possesses to construct a network based on as square with the word, and each edge is described as line word frequency and word co-occurrences. The process of between two vertices. The described square, at the core, is our method is as follows: the frequent word which means the user's core word For example, we measured the frequency and fitness Step I Papers use natural language and require modifi- of the words using a paper[7]hereinafter called"pa cation before processing. The most frequent terms, such as a, andit,, are considered to be common and per A), a paper[6](hereinafter called"paper B"), and a paper[4 (hereinafter called"paper C"). Both paper A and meaningless[14]. For the reason, we should first re- paper B describe language and networks, and paper Cde move stopwords used in SMART system [21] scribes networks ep 2 Based on the assumption that terms with a common Table I shows a list of the top ten most frequent words stem usually have similar meanings, various-ED, This list consists of the word frequency and fitness of words. ING, -ION,IONS suffixes are removed to produce the The fitness i of words in Table 1 are calculated in the order word. For example, PLAY, PLAYS, PLAYED of paper A, paper B, and paper C. The fitness l of words in PLAYING are translated into play Our method em Table I are calculated in the order of paper A, paper C, and ployed Porter's suffix stripping algorithm[ 19] paper B From the fitness value point of view in Table 1, even if Step 3 Our method continuously adds words and word a different users read the same papers at different a period co-occurrences to the network. As previously men- a different value will be calculated. Thus, the latest paper tioned words are the network vertices and word co- which a user reads alters/changes the fitness occurrences are network edges. If the words or the An interface for paper recomm on using Papits can word co-occurrences have already been added to the be seen in Figure 4. Inside the bold line in Figure 4 is the network, they are not repeated. paper recommendation with title, authors, and paper rele Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
distribution r(η). Each new vertex connects with m edges to the vertices already in the network, and the probability of connecting to a vertex i is proportional to the degree and the fitness of vertex i, Πi = ηiki j ηjkj (2) It is well known that the more frequent a word, the more available it is for production and comprehension processes. This phenomenon is known as the frequency (referring to the whole individual’s experience) or recency(referring to the recent individual’s experience) effect. This phenomenon shows that preferential attachment is very likely to shape the scale-free distribution of degrees[7]. To deal with the characteristics (2) and (3), we looked upon words in papers which a user possesses as the vertices of the scale-free network, and word co-occurrence as the edges. We also calculated the frequency and the fitness of words from the Equation 2. Checking the user’s interests or specialities secularly, we considered that the user’s interests or specialities are determined using words that frequently appear in the paper viewing history, and words which appears most recently. Namely, we represent the frequency of the network as the user’s longer-term interest and a fitness of network as the user’s shorter-term interest. 2.2. Construction of user’s model This section outlines the construction of the user model based on the user’s paper viewing history. Our method uses papers that a user possesses to construct a network based on word frequency and word co-occurrences. The process of our method is as follows: Step 1 Papers use natural language and require modifi- cation before processing. The most frequent terms, such as ‘a’ and ‘it’, are considered to be common and meaningless[14]. For the reason, we should first remove stopwords used in SMART system [21]. Step 2 Based on the assumption that terms with a common stem usually have similar meanings, various -ED, - ING, -ION, -IONS suffixes are removed to produce the stem word. For example, PLAY, PLAYS, PLAYED, PLAYING are translated into PLAY. Our method employed Porter’s suffix stripping algorithm[19]. Step 3 Our method continuously adds words and word co-occurrences to the network. As previously mentioned words are the network vertices and word cooccurrences are network edges. If the words or the word co-occurrences have already been added to the network, they are not repeated. User's model Figure 2. A creation procedure of user’s model The construction of the user’s model[22] can be seen in Figure2. The user model generation mechanism uses papers which a user possesses to eliminate stopwords, preprocessing based on making stems, and construct or adjust the user’s model. The comparison and selection mechanism seen in Figure 2. The constructed user’s model is compared to papers in a Papits database. By comparing the user’s paper viewing history to the user’s model, Papits can recommend papers which are of interest to the user. Figure 3 represents the user’s model made from a paper[7] as the preprocessed network. Each vertex of Figure 3 is described as square with the word, and each edge is described as line between two vertices. The described square, at the core, is the frequent word which means the user’s core word. For example, we measured the frequency and fitness of the words using a paper[7](hereinafter called “paper A”), a paper[6](hereinafter called “paper B”), and a paper[4](hereinafter called “paper C”). Both paper A and paper B describe language and networks, and paper C describes networks. Table 1 shows a list of the top ten most frequent words. This list consists of the word frequency and fitness of words. The fitness I of words in Table 1 are calculated in the order of paper A, paper B, and paper C. The fitness II of words in Table 1 are calculated in the order of paper A, paper C, and paper B. From the fitness value point of view in Table 1, even if a different users read the same papers at different a period, a different value will be calculated. Thus, the latest paper which a user reads alters/changes the fitness. An interface for paper recommendation using Papits can be seen in Figure 4. Inside the bold line in Figure 4 is the paper recommendation with title, authors, and paper releProceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
66 com Figure 3. An example of constructed network vance. The relevance means the similarity value. As shown co-occurrence, we can see the change of relation among in Figure 4, Papits can recommend several papers in de- words, and solve the characteristic(2)mentioned in Section sending order of similarity, based on our recommendation 1,1.e, whether or not a paper is novel mechanism We measure the frequencies of word co-occurrence and the recencies of word co-occurrence to divide into four sit 2.3. Construction of topic model uations as follows: This section presents how to construct the topic model. h elated to the keywords is commonly Our method uses a huge papers from a database of Papits known to construct the topic model which based on frequencies of word co-occurrence and recencies of word co-occurrence If the frequency of word co-occurrence is low, the re- A database of Papits contains bibliographical information earch topic related to te keywords is not known. Few of information technology articles, which includes the year researchers show interest in the topictopic of publication, and Papits can retrieval the information ac If the recency of word co-occurrence moves upward cording to the year of publication. By observing the fre- the research topic related to the keywords is hot and quencies of word co-occurrence and recencies of word roIs Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
Figure 3. An example of constructed network vance. The relevance means the similarity value. As shown in Figure 4, Papits can recommend several papers in descending order of similarity, based on our recommendation mechanism. 2.3. Construction of topic model This section presents how to construct the topic model. Our method uses a huge papers from a database of Papits to construct the topic model which based on frequencies of word co-occurrence and recencies of word co-occurrence. A database of Papits contains bibliographical information of information technology articles, which includes the year of publication, and Papits can retrieval the information according to the year of publication. By observing the frequencies of word co-occurrence and recencies of word co-occurrence, we can see the change of relation among words, and solve the characteristic(2) mentioned in Section 1, i.e.,whether or not a paper is novel. We measure the frequencies of word co-occurrence and the recencies of word co-occurrence to divide into four situations as follows: • If the frequency of word co-occurrence is high, the research topic related to the keywords is commonly known. • If the frequency of word co-occurrence is low, the research topic related to te keywords is not known. Few researchers show interest in the topic. topic. • If the recency of word co-occurrence moves upward, the research topic related to the keywords is hot and promising. Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
合p x回M/w424k中N0年单WA人+机t15 sintanl Lab. Google NT Badminton@ Yahoo! JAPAN Researchindex@ Apple Lve Home Page *M-k(@ Apple Store @Tools Papers The recommending home search recommen register 力子刂横索 Relevance Bundle Design in Robust Combinatorial Auction Makoto Yokoo yuko Protocol against False-name Sakurai Shigeo Matsubara prvate ibrary 力元少22 NTATION Satisfiability Problems wit risen an REASONING Stochastic Local Sea Timothy J. Peugniez SEARCH Relational Learning via Propositional Algorithms An Dan Roth Wen-tau Yih NT Information Extraction Case. Study The Exponentiated Subgradient Algorithm for Dale Schuurmans and MODELING Heuristic Boolean Finnegan Southey Robert Programming C Holte PLANNING ,Exploiting Multiple Secondary O htemetzon Figure 4. An interface for paper recommendation If the recency of word co-occurrence moves down- By comparing Rwnwm(t)to Rwn wm(t-1), we calculate ward, the research topic related to the keywords is an the recency between words wn and wn i. e, the recency of closing. topic moves upward or downward In order to select from papers in a database of Papits, we calculate the topic weight Tun. tm as follows Tun Um Tfreq wn, wm) nvn(t-1)≠0) recent Recency(wn, Wm)= Run tom(t) where, Wn and w'm are words which co-occur in a same sentence. The Tfreq(wn, W'm)is the number of co- Rw, wm(0)=1 Trecency(wn, Wm)is novelty of the topic. We use Equation 4 to calculate Trecency(wn, Wm). Also, we use the Jaccard coefficient to calculate Rt. The Run wm (t) is the Jaccard mnmn()=lan∩nml and wn at the time t(Equation 5) Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
Figure 4. An interface for paper recommendation • If the recency of word co-occurrence moves downward, the research topic related to the keywords is an closing. In order to select from papers in a database of Papits, we calculate the topic weight Twn,wm as follows: Twnwm = Tf req(wn, wm) · Trecency(wn, wm) (3) where, wn and wm are words which co-occur in a same sentence. The Tf req(wn, wm) is the number of cooccurrences in all papers in a database of Papits, and the Trecency(wn, wm) is novelty of the topic. We use Equation 4 to calculate Trecency(wn, wm). Also, we use the Jaccard coefficient to calculate Rt. The Rwnwm(t) is the Jaccard coefficient between wn and wn at the time t (Equation 5). By comparing Rwnwm(t) to Rwnwm (t − 1), we calculate the recency between words wn and wn i.e., the recency of topic moves upward or downward. Trecency(wn, wm) = Rwnwm (t) Rwnwm(t−1) (Rwnwm (t − 1) = 0) Rwnwm(t) (Rwnwm (t − 1) = 0) Rwnwm(0) = 1 (4) Rwnwm(t) = |wn ∩ wm| |wn ∪ wm| (5) Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
network 0.274 0.141 calculate similarity word 0.000 67 0.000 0.167 connect 0.198 0.206 WWW 0.133 0.187 0.109 feat 0.172 0.000 0.2830.149 Paper categorizabort agent Paper collection 0.323 0.189 0.164 Table 1. Frequencies of words and fitness of Figure 5. An architecture of Papits words In Equation6, we define fitness n of network as'recency 3. Research Support System Papits of a word and also define the number of edges that each vertex has as the 'frequency'. By using word co-occurrence in a sentence, we can identify the meaning of polysemous This section outlines Papits, which has several functions words and characterize words that appear at low frequencies that allow it to manage research information by paper shar in a paper. Equation 6 calculates the similarity between the ing, paper recommending, paper retrieving, paper classify- user's model and a paper. We obtained the relationship of ing and a research diary. Papits is a multiagent based re- word co-occurrence from user's paper viewing history, and using WebObjects2 and MiLog[9] which is the Java based list papers according to the users pior quently, we can search support system, implemented as a web application compared a user's model to papers. Consequently, we can mobile agent framework. MiLog provides useful functions Equation 6 calculates the similarity between a user's for effective Web access and allows users access via a Web browser. The paper sharing function facilitates the sharing model Nx and a paper Py, of research information such as the pdf files of research papers, and to collect papers from Web sites. This paper mainly discusses the paper recommending function, which sim(Nx,B)=∑∑Ⅱ)(6) can provide intense support to surveys on fields of research interest. The Papits architecture is shown in Figure 5 where n is the number of terms that appears in N x and 3. 1. Paper recommendation agent Py. Application of Equation 6 allows us to calculate the similarity between a user's model Nx and Py. The Equa- tion 6 uses word frequency and word fitness. Ili is the value The paper recommendation agent recommends papers of the word w,i mentioned in the Equation 2. which a user may want to read. First, the agent constructs However, Equation 6 can't consider that whether or not a a user's model from the user's paper viewing history as de- paper is novel and whether or not a paper is of interest. Ad- scribed in Section 2. Second, the agent compares the user's ditionally, to solve the characteristic( 2)and(3)mentioned model to a paper's profile in a database on Papits. A paper's in section 1, we expand Equation 6 as follows profile is constructed from PDF files included in a database representative of the paper's cor tents. The profiles are based on a thesaurus constructed from the pdf files mx,)=m∑>和叫( Our paper recommendation mechanism focuses attention on Equation 6 and word co-occurrence in papers where Twqw, is the topic model mentioned in Section 7. 2 Webobjects is a tool for creating a Web Application, developed by 23. kii is the number of word co-occurrence between words w: and w Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
word frequency fitness I fitness II network 87 0.274 0.141 word 81 0.000 0.276 langu 67 0.000 0.167 connect 57 0.198 0.206 numb 48 0.000 0.133 observ 44 0.187 0.109 feat 43 0.172 0.000 distribut 43 0.283 0.149 system 39 0.323 0.00 found 39 0.189 0.164 Table 1. Frequencies of words and fitness of words 3. Research Support System Papits This section outlines Papits, which has several functions that allow it to manage research information by paper sharing, paper recommending, paper retrieving, paper classifying and a research diary. Papits is a multiagent based research support system, implemented as a web application using WebObjects2 and MiLog[9] which is the Java based mobile agent framework. MiLog provides useful functions for effective Web access and allows users access via a Web browser. The paper sharing function facilitates the sharing of research information, such as the PDF files of research papers, and to collect papers from Web sites. This paper mainly discusses the paper recommending function, which can provide intense support to surveys on fields of research interest. The Papits architecture is shown in Figure 5. 3.1. Paper recommendation agent The paper recommendation agent recommends papers which a user may want to read. First, the agent constructs a user’s model from the user’s paper viewing history as described in Section 2. Second, the agent compares the user’s model to a paper’s profile in a database on Papits. A paper’s profile is constructed from PDF files included in a database on Papits. This profile is representative of the paper’s contents. The profiles are based on a thesaurus constructed from the PDF files. Our paper recommendation mechanism focuses attention on Equation 6 and word co-occurrence in papers. 2WebObjects is a tool for creating a Web Application, developed by Apple Figure 5. An architecture of Papits In Equation6, we define fitness η of network as ‘recency’ of a word and also define the number of edges that each vertex has as the ‘frequency’. By using word co-occurrence in a sentence, we can identify the meaning of polysemous words and characterize words that appear at low frequencies in a paper. Equation 6 calculates the similarity between the user’s model and a paper. We obtained the relationship of word co-occurrence from user’s paper viewing history, and compared a user’s model to papers. Consequently, we can list papers according to the users’ priority. Equation 6 calculates the similarity between a user’s model NX and a paper PY , sim(NX, PY ) = 1 n2 n i=1 ( n j=1 ΠiΠj ) (6) where n is the number of terms that appears in NX and PY . Application of Equation 6 allows us to calculate the similarity between a user’s model NX and PY . The Equation 6 uses word frequency and word fitness. Πi is the value of the word wi mentioned in the Equation 2. However, Equation 6 can’t consider that whether or not a paper is novel and whether or not a paper is of interest. Additionally, to solve the characteristic(2) and (3) mentioned in section 1, we expand Equation 6 as follows: sim(NX, PY ) = 1 n2 n i=1 n j=1 Twiwj kwiwj ΠiΠj (7) where Twiwj is the topic model mentioned in Section 2.3. kij is the number of word co-occurrence between words wi and wj . Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
3. 2. Paper collection agent the topic model mentioned in Section 2.3 from the charac teristic(2)and (3)points of view point of time six, four, and There are many papers which can be accessed by the two months ago public via the Internet, and a paper collection agent is able The experiment is as follows. We collected users'mod- to collect these papers as PDF files. A conventional pa- els which were measured and papers which are read over a per collection agents can collect files from researchers'web eight months period. Based on our method, we eliminated sites, the Research Index, the ACM Digital Library4, and stopwords, and make stems as a preprocessing step. Ad ditionally, we added words and word co-occurrences to the network based on the order of read papers, and represent the network as user's model 3.3. Paper categorization agent We apply papers that were stored in a database of Pa The paper categorization agent categorizes papers in- pits. Over 10,000 papers stored in a database of Papits that cluded within the database. Initially papers in the database describe information technology in English were included. are not categorized, but eventually the paper categorization 4.1. Vector Space Model agent categorizes papers into pre-defined categories based on existing classifiers [13]. Automatic classification helps The vector space model[20] is widely used in informa te papers by following their category of interest. tion retrieval systems. In this model, documents and queries The main problem in automatic text classification is to iden- are represented as bags of terms, and statistics concerning tify what words are the most suitable to classify documents these terms and the documents they appear in are gathered in predefined classes. This section discusses the text classi- together into an index. In the index, each distinct term t fication method for Papits and our feature selection method. has an associated document frequency, denoted ft, which In Papits, automatic classification needs to classify doc- indicates the number of documents it appears in. In addi- uments into the multivalued category, because research is tion, each term is associated with an inverted list of pointers organized in various fields. However, feature selection be-< d, ft recording that term t appears in document d comes sensitive to noise and irrelevant data compared to total of fa t times. Moreover, each document d has a corre cases with few categories. There may also not be enough sponding value Wa associated with it, its document length, registered papers as training data to identify the most suit- which is calculated as a function of ft and fa t for the terms able words to classify into the multivalued category in Pa- in that document. Generally speaking, W d is greater when pits.We propose feature selection to classify documents, a document becomes physically longer, but Wd usually de- which is represented by a bag-of-words, into the multival- pends also upon the relative scarcity of the terms in the doc ued category. Several existing feature selection techniques ument use some metric to determine the relevance of a term with In ranking a query q against the database, the vector regard to the classification criterion. Information gain (G) space model employs a similarity heuristic to calculate a is often used in text classification in the bag-of-words ap- score S, d between g and each document d of the database d can be described as 4. Evaluation Experiments Wdt x w We measured the our methods effectiveness in terms of where the values of wd, t and wg, t, called term impacts or ecall and analysis misrecognition. We evaluated the ef simply impacts[2], represent the degree of"import fectiveness by comparing our method to the Vector Space term t in document d and query g respectively, and are cal Model(VSM)[21], co-occurrence-based thesaurus[l1, and culated from fd, t, fg,t, ft, Wd, and Wa. It should be noted IRM[16]. IRM is the mechanism that supports users in that we employ a common notation for document impacts Web browsing, similar to our method. However, IRM does and query impacts(that is, the impact values of document not consider long and short-range interests. So, we evalu- terms and query terms res ated whether our method can measure long-range and short they can in fact have different formulations in terms of the range interests. We use Equation 7 to measure whether our underlying values fd, t, fg, t, ft, Wa, and Wq method was more reliable the other existing methods Addi- tionally, we compare Equation 7 to Equation 6, we measure 4.2. Co-occurrence based thesaurus http://portal.acmorg/dl.cfm Terms used in documents in a sentence differ Shttp://www.sciencedirect.com/ another and meanings of a term differ, depend Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
3.2. Paper collection agent There are many papers which can be accessed by the public via the Internet, and a paper collection agent is able to collect these papers as PDF files. A conventional paper collection agents can collect files from researchers’ web sites, the Research Index3, the ACM Digital Library4, and Science Direct5. 3.3. Paper categorization agent The paper categorization agent categorizes papers included within the database. Initially papers in the database are not categorized, but eventually the paper categorization agent categorizes papers into pre-defined categories based on existing classifiers [13]. Automatic classification helps users locate papers by following their category of interest. The main problem in automatic text classification is to identify what words are the most suitable to classify documents in predefined classes. This section discusses the text classi- fication method for Papits and our feature selection method. In Papits, automatic classification needs to classify documents into the multivalued category, because research is organized in various fields. However, feature selection becomes sensitive to noise and irrelevant data compared to cases with few categories. There may also not be enough registered papers as training data to identify the most suitable words to classify into the multivalued category in Papits. We propose feature selection to classify documents, which is represented by a bag-of-words, into the multivalued category. Several existing feature selection techniques use some metric to determine the relevance of a term with regard to the classification criterion. Information gain (IG) is often used in text classification in the bag-of-words approach. 4. Evaluation Experiments We measured the our method’s effectiveness in terms of recall and analysis misrecognition. We evaluated the effectiveness by comparing our method to the Vector Space Model (VSM)[21], co-occurrence-based thesaurus[11], and IRM[16]. IRM is the mechanism that supports users in Web browsing, similar to our method. However, IRM does not consider long and short-range interests. So, we evaluated whether our method can measure long-range and shortrange interests. We use Equation 7 to measure whether our method was more reliable the other existing methods. Additionally, we compare Equation 7 to Equation 6, we measure 3http://citeseer.ist.psu.edu/ 4http://portal.acm.org/dl.cfm 5http://www.sciencedirect.com/ the topic model mentioned in Section 2.3 from the characteristic(2) and (3) points of view point of time six, four, and two months ago. The experiment is as follows. We collected users’ models which were measured and papers which are read over a eight months period. Based on our method, we eliminated stopwords, and make stems as a preprocessing step. Additionally, we added words and word co-occurrences to the network based on the order of read papers, and represent the network as user’s model. We apply papers that were stored in a database of Papits. Over 10,000 papers stored in a database of Papits that describe information technology in English were included. 4.1. Vector Space Model The vector space model[20] is widely used in information retrieval systems. In this model, documents and queries are represented as bags of terms, and statistics concerning these terms and the documents they appear in are gathered together into an index. In the index, each distinct term t has an associated document frequency, denoted f t, which indicates the number of documents it appears in. In addition, each term is associated with an inverted list of pointers recording that term t appears in document d a total of fd,ttimes. Moreover, each document d has a corresponding value Wd associated with it, its document length, which is calculated as a function of ft and fd,t for the terms in that document. Generally speaking, Wd is greater when a document becomes physically longer, but Wd usually depends also upon the relative scarcity of the terms in the document. In ranking a query q against the database, the vector space model employs a similarity heuristic to calculate a score Sq,d between q and each document d of the database. Sq,d can be described as Sq,d = t∈q∩d wd,t × wq,t where the values of wd,t and wq,t, called term impacts or simply impacts[2], represent the degree of “importance” of term t in document d and query q respectively, and are calculated from fd,t, fq,t, ft, Wd, and Wq. It should be noted that we employ a common notation for document impacts and query impacts (that is, the impact values of document terms and query terms respectively) for simplicity, and that they can in fact have different formulations in terms of the underlying values fd,t, fq,t, ft, Wd, and Wq. 4.2. Co-occurrence based thesaurus Terms used in documents in a sentence differ from one another and meanings of a term differ, depending on the Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
situation in which the term is used. the difference is a char. vSM--Thesaurus IRM-7Our method(Eq 6r7-Our method(Eq acteristic of the source and is used for selection Terms used Precision in each source are distinguished by the words that occur in a source and their frequencies of occurrence. However, meth ods using only statistical data face the problems caused by 0.5 polysemous words. This thesaurus based method, the mean- ings of a term are distinguished between by the relationship between the term and other terms sim(X,)=- where n is the number of terms that appears at X and y i and yi are the elements of ith row of the square matrix, which is constructed from a document X and a document 4.3. IRM two months ago IRM[16] denote the unconditional probability of a fre- Figure 6. An Experimental Result quent term EG as the expected probability Pg, and the to- tal number of co-occurrence of term wi and frequent terms G as fG(wi. Frequency of co-occurrence of term w; and term E G is written as freqlwi, g). The statistics value of x is defined as follows.(subscripting to representin document].) other methods and equation 6 of our method move down ward. Other methods can't resolve the characteristic(2) freq(wij, 9)-f(wi)pa)2 i.e., whether or not recommended papers are novel. How ever, precision of Equation 7 of our method moves upward fg(wii)pg Namely, the topic model mentioned in Section 2.3 is ef- fective to resolve the characteristic(2), and whether or not If x(w)>x2 a, the null hypothesis is rejected with recommended significance level a(xd is normally obtained from statisti- cal tables, or by integral calculation). The term fG(wi)Pg characteristic(3) mentioned in Section 1. From the charac- represents the expected frequency of co-occurrence, and teristic(3) point of view in Table 3, The precision of all other (freq(w, g)-fGwii)Pg)represents the difference be- methods moves downward. Other methods can't resolve the tween expected and observed frequencies. Therefore, large characteristic(3), i. e, whether or not recommended papers xii indicates that co-occurrence of term wi shows strong are of interest at the present moment. However, precision of bias.IRM uses the x2- measure as an index of biases, not Equation 6 and 7 is effective to resolve the characteristic(3), for tests of hypotheses and whether or not recommended papers are of interest at the 4.4. Experimental Result These experimental results show that correctness of rec- ommendation of our method is higher than that of other Figure 6 shows the precision of recommendation correct- methods. VSM and co-occurrence-based thesaurus IRM ness for each method. The horizontal axis of Figure shows a checked point of time stamp and the vertical axis show the precision of recommendation. The precision of other exist 5. Related works ng method, VSM, Thesaurus, IRM and Equation 6 keep or move downward. However, the precision of Equation 7 of Information recommendation is helpful in reducing the our method move upward. noise in a document and preventing information overload the characteristic(2) mentioned in Section 1. From the ported. Miura[17] proposed an adaptive Web which dy characteristic(2) point of view in Table 2, The precision of namically changed information content and exhibition due Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
situation in which the term is used. The difference is a characteristic of the source and is used for selection. Terms used in each source are distinguished by the words that occur in a source and their frequencies of occurrence. However, methods using only statistical data face the problems caused by polysemous words. This thesaurus based method, the meanings of a term are distinguished between by the relationship between the term and other terms. sim(X, Y ) = 1 n · n i=1 xiyi n i=1 x2 i n i=1 y2 i where n is the number of terms that appears at X and Y . xi and yi are the elements of ith row of the square matrix, which is constructed from a document X and a document Y . 4.3. IRM IRM[16] denote the unconditional probability of a frequent term g ∈ G as the expected probability pg , and the total number of co-occurrence of term wi and frequent terms G as fG(wi). Frequency of co-occurrence of term wi and term g ∈ G is written as freq(wi, g). The statistics value of χ2 is defined as follows. (subscripting to represent“ in document j ”.) χ2 ij = g∈G (freq(wij , g) − fG(wij )pg)2 fG(wij )pg If χ2(w) > χ2 α, the null hypothesis is rejected with significance level α (χ2 α is normally obtained from statistical tables, or by integral calculation). The term fG(wij )pg represents the expected frequency of co-occurrence, and (freq(w, g) − fG(wij )pg) represents the difference between expected and observed frequencies. Therefore, large χ2 ij indicates that co-occurrence of term wi shows strong bias. IRM uses the χ2- measure as an index of biases, not for tests of hypotheses. 4.4. Experimental Result Figure 6 shows the precision of recommendation correctness for each method. The horizontal axis of Figure 6shows a checked point of time stamp and the vertical axis show the precision of recommendation. The precision of other existing method, VSM, Thesaurus, IRM and Equation 6 keep or move downward. However, the precision of Equation 7 of our method move upward. Table 2 shows how effective each method can resolve the characteristic(2) mentioned in Section 1. From the characteristic(2) point of view in Table 2, The precision of 0.24 0.26 0.27 0.33 0.34 0.32 0.35 0.37 0.37 0.37 0.4 0.38 0.51 0.52 0.57 0 0.1 0.2 0.3 0.4 0.5 0.6 six months ago four months ago two months ago VSM Thesaurus IRM Our method(Eq 6) Our method(Eq 7) Figure 6. An Experimental Result other methods and Equation 6 of our method move downward. Other methods can’t resolve the characteristic(2), i.e., whether or not recommended papers are novel. However, precision of Equation 7 of our method moves upward. Namely, the topic model mentioned in Section 2.3 is effective to resolve the characteristic(2), and whether or not recommended papers are novel. Table 3 shows how effective each method can resolve the characteristic(3) mentioned in Section 1. From the characteristic(3) point of view in Table 3, The precision of all other methods moves downward. Other methods can’t resolve the characteristic(3), i.e., whether or not recommended papers are of interest at the present moment. However, precision of Equation 6 and 7 is effective to resolve the characteristic(3), and whether or not recommended papers are of interest at the present moment. These experimental results show that correctness of recommendation of our method is higher than that of other methods, VSM and co-occurrence-based thesaurus, IRM. 5. Related Works Information recommendation is helpful in reducing the noise in a document and preventing information overload. Several methods of information filtering have been reported. Miura[17] proposed an adaptive Web which dynamically changed information content and exhibition due Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
six months ago four months ago two months ago 0.15 0.14 0.10 0.26 0.26 0.19 Our Method(Eq 6) 0.35 Our Method (Eq7) 0.49 Table 2. Precision from the characteristic(2 )point of view six months ago four months ago two months ago VSM 0.13 0.11 0.10 0.24 0.33 0.30 Our Method(Eg. 6)I 0.33 0.33 0.34 Our Method (Eq7) 0.50 Table 3. Precision from the characteristic( 3) point of view to users'interests. Hamasaki[ 12] proposed a recommender viewing history and the topic model which is based on the systems kMedia, which is based on user's interests indi- topic frequency and the topic recency. cated by their web browser bookmarks. Chen[5] proposed We constructed a user's model based on the scale -free proxy-based recommendation system using an artificial life network from a user's paper viewing history in order to a and TF mechanism. We determined a user's interests and user's long-range and short-range interests. We represent a pecialties based on their paper viewing history using the user's long-range interests as the frequency of the network scale-free network and we also represent a user's short-range interests as the In conventional method, Matsuo[ 16]propose IRM which fitness of the network. Also, we constructed the topic model takes advantage of a user's paper web browsing history to from papers in a database of Papits. The topic model has support user's web browsing. However, IRM deals with all wo elements: the topic frequency which is based on word browsing history as the same, and is unable to represent an co-occurrence, and the topic recency which is based on the users long-range interests. Our method utilizes the Jaccard coefficient paper viewing history and represents the user's long and One of the main problem for the recommendation is how ace. t -range interests as word frequency and word recency Matsumura[ 15] proposed PAI which uses a spreading to reduce information overload and realize a precise and ac ivation model without using corpus, thesaurus, syntac curate recommendation. To solve this problem, we use a recommendation mechanism in Papits. Our recommenda- tic analysis, dependency relations between terms, or any tion mechanism uses the user's model and the topic model other knowledge except stopword lists and extracted key in order to solve the characteristics mentioned in Section words from a single document. However, our method uses 1. Conventional recommendation mechanisms mainly deal ultiple papers viewing history to make a user's model with the characteristic(1); the importance of papers, for ex ample, by using a statistics approach. Additionally, to deal 6. Conclusion and future work with the characteristics(2)and 3), we utilized the users er viewing history and the topic model. This allowed us to We proposed Papits, a research support system for the check whether or not a paper is novel. Moreover, this mon- effective recommendation of research documents. Our toring enabled us to also determine a users preference and method uses a user's paper viewing history to identify th interest and check whether or not a paper is of interest at the interests. The paper recommendation agent can recommend papers which may be of interest. By using a user's paper Papits reduces information overload and realizes a pre Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
six months ago four months ago two months ago VSM 0.15 0.14 0.10 Thesaurus 0.26 0.22 0.20 IRM 0.26 0.23 0.19 Our Method(Eq. 6) 0.35 0.33 0.32 Our Method(Eq. 7) 0.46 0.47 0.49 Table 2. Precision from the characteristic(2) point of view six months ago four months ago two months ago VSM 0.13 0.11 0.10 Thesaurus 0.24 0.22 0.22 IRM 0.33 0.31 0.30 Our Method(Eq. 6) 0.33 0.33 0.34 Our Method(Eq. 7) 0.50 0.52 0.52 Table 3. Precision from the characteristic(3) point of view to users’ interests. Hamasaki[12] proposed a recommender systems kMedia, which is based on user’s interests indicated by their web browser bookmarks. Chen[5] proposed proxy-based recommendation system using an artificial life and TF mechanism. We determined a user’s interests and specialties based on their paper viewing history using the scale-free network. In conventional method, Matsuo[16] propose IRM which takes advantage of a user’s paper web browsing history to support user’s web browsing. However, IRM deals with all browsing history as the same, and is unable to represent an user’s long-range interests. Our method utilizes the user’s paper viewing history and represents the user’s long and short-range interests as word frequency and word recency. Matsumura[15] proposed PAI which uses a spreading activation model without using corpus, thesaurus, syntactic analysis, dependency relations between terms, or any other knowledge except stopword lists and extracted keywords from a single document. However, our method uses multiple papers viewing history to make a user’s model. 6. Conclusion and Future Work We proposed Papits, a research support system for the effective recommendation of research documents. Our method uses a user’s paper viewing history to identify their interests. The paper recommendation agent can recommend papers which may be of interest. By using a user’s paper viewing history and the topic model which is based on the topic frequency and the topic recency. We constructed a user’s model based on the scale-free network from a user’s paper viewing history in order to a user’s long-range and short-range interests. We represent a user’s long-range interests as the frequency of the network, and we also represent a user’s short-range interests as the fitness of the network. Also, we constructed the topic model from papers in a database of Papits. The topic model has two elements: the topic frequency which is based on word co-occurrence, and the topic recency which is based on the Jaccard coefficient. One of the main problem for the recommendation is how to reduce information overload and realize a precise and accurate recommendation. To solve this problem, we use a recommendation mechanism in Papits. Our recommendation mechanism uses the user’s model and the topic model, in order to solve the characteristics mentioned in Section 1. Conventional recommendation mechanisms mainly deal with the characteristic(1); the importance of papers, for example, by using a statistics approach. Additionally, to deal with the characteristics(2) and (3), we utilized the user’s paper viewing history and the topic model. This allowed us to check whether or not a paper is novel. Moreover, this monitoring enabled us to also determine a user’s preference and interest and check whether or not a paper is of interest at the present moment. Papits reduces information overload and realizes a preProceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
cise and accurate recommendation. We showed the effec- [16]Y Matsuo, H Fukuta, and M Ishizuka. browsing support by tiveness of our method compare with other methods. highlighting keywords based on a user's browsing history. In EEE SMC-2002,2002 [17 N. Miura, K. Takahashi, and K Shima. A user-models con- References struction method for personal-adaptive www(special issue on next generation human interface and interaction). Trans 1 R. Albert and A.-L. Barabasi. Statistical mechanics of com- actionis of Information Processing Society of Japan, 39(5) plex networks. Reviews of Modern Physics, 2002 1999 22] V.N. Anh, O de Kretser, and A. Moffat. Vector-space rank [18] T. Ozono, S. Goto, N. Fujimaki, and T. Shintani. P2p g with effective early termination. In Proceedings 24th based knowledge source discovery on research support sys Annual International ACM SIGIR Conference on Research tem papits. In The First International Joint Conference on and Development in information Retrieval, 2001 Autonomous Agents Multiagent Systems(AAMAS2002 2002. scale-free random networks. Physica A. 272: 173-187. 1999. [191 M. Porter. An algorithm for suffix stripping. Automated [4 A.-L. Barabasi and R. Albert. Emergence of scaling in ra dom networks. Science. 286: 509-512. 1999 [20]G. Salton. Automatic text processing. Addison-Wesley MA [ 5] C C Chen, M.Chen, and YSun. A web document personal ization user model and system. In Proceedings of &th Inter- [211 G. Salton and M. McGill. Introduction to modern informa- tion retrieval. mcgraw-Hill. 1983 [6] S. N. Dorogovtsev and J. F. F. Mendes. Language as an [221 M. Sugimoto. User modeling and adaptive interaction Proceeding of The Royal Society of information gathering systems. Journal of Japanese Sociery London Series B. Biological Science, volume 268, pages for Artificial intelligence, 14(1), 1999 2603-2606.2001 [7] R. Ferrer, I. Cancho, and R Sole. The small-world of human language. In Proceedings of the Royal Society of london pages2261-2266,2001 [8] N. Fujimaki, T Ozono, and T Shintani. Flexible query mod fier for research support system papits. In Proceedings of the lASTEd international Conference on Arificial and Com putational Intelligence(AC/2002), pages 142-147, 2002 [9 N. Fukuta, T. Ito, and T Shintani. Logic-based framework for mobile intelligent information agents. In Proceedings of e 10th International World Wide Web Conference, pages 58-59.200 [101 B. G. and A.-L. Barabasi. Competition and multiscaling in evolving networks. Euro physics Letters, 54(4): 436-442 [11 S Goto, T. Ozono, and T Shintani. A method for infor- mation source selection using thesaurus for distributed in- formation retrieval. In the Proceedings of PAlS200, pages [121 M. Hamasaki and H. Takeda. Experimental results for a method to discover of human relationship based on www bookmarks InIn Proceedings of Fifth international Confer- ence on Knowledge- Based intelligent Information Engineer- ing Systems Allied Thchnologies(KES-2001), volume 2, pages I291-1295,2001 [13 T. Hasegawa, T Ozono, T. Ito, and T. Shintani. A feature selection for text categorization on research support system papits. In Proc. of the &th pacific Rim International Confer ence on Artificial Intelligence(PRICA1-04, 2004 [14] H.P. Luhn. A statistical approach to the mechanized encod- ing and searching of literary inforamtion. IBM Journal of Documentation, 1(4): 309-317, 1957 [151 N Matsumura, Y Ohsawa, and M. Ishizuka. Pai: Automati indexing for extracting asserted keywords from a document. Journal of New Generation Computing(Springer Verlag and Ohmsha),2l(1):37-47,2002. Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY
cise and accurate recommendation. We showed the effectiveness of our method compare with other methods. References [1] R. Albert and A.-L. Barabasi. Statistical mechanics of complex networks. Reviews of Modern Physics, 2002. [2] V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. In Proceedings 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. [3] A. Barabasi, R. Albert, and H. Jeong. Mean-field theory for scale-free random networks. Physica A, 272:173–187, 1999. [4] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999. [5] C.C.Chen., M.Chen, and Y.Sun. A web document personalization user model and system. In Proceedings of 8th International Conference on User Modelling, 2001. [6] S. N. Dorogovtsev and J. F. F. Mendes. Language as an evolving word web. In Proceeding of The Royal Society of London. Series B. Biological Science, volume 268, pages 2603–2606, 2001. [7] R. Ferrer, I. Cancho, and R. Sole. The small-world of human language. In Proceedings of the Royal Society of London, pages 2261–2266, 2001. [8] N. Fujimaki, T. Ozono, and T. Shintani. Flexible query modifier for research support system papits. In Proceedings of the IASTED International Conference on Arificial and Computational Intelligence(ACI2002), pages 142–147, 2002. [9] N. Fukuta, T. Ito, and T. Shintani. Logic-based framework for mobile intelligent information agents. In Proceedings of the 10th International World Wide Web Conference, pages 58–59, 2001. [10] B. G. and A.-L. Barabasi. Competition and multiscaling in evolving networks. Euro physics Letters, 54(4):436–442, 2001. [11] S. Goto, T. Ozono, and T. Shintani. A method for information source selection using thesaurus for distributed information retrieval. In the Proceedings of PAIS2001, pages 272–277, 2001. [12] M. Hamasaki and H. Takeda. Experimental results for a method to discover of human relationship based on www bookmarks. In In Proceedings of Fifth International Conference on Knowledge-Based Intelligent Information Engineering Systems & Allied Thchnologies (KES-2001), volume 2, pages 1291–1295, 2001. [13] T. Hasegawa, T. Ozono, T. Ito, and T. Shintani. A feature selection for text categorization on research support system papits. In Proc. of the 8th Pacific Rim International Conference on Artificial Intelligence (PRICAI-04), 2004. [14] H.P.Luhn. A statistical approach to the mechanized encoding and searching of literary inforamtion. IBM Journal of Documentation, 1(4):309–317, 1957. [15] N. Matsumura, Y. Ohsawa, and M. Ishizuka . Pai: Automatic indexing for extracting asserted keywords from a document. Journal of New Generation Computing (Springer Verlag and Ohmsha), 21(1):37–47, 2002. [16] Y. Matsuo, H. Fukuta, and M. Ishizuka. browsing support by highlighting keywords based on a user’s browsing history. In IEEE SMC-2002, 2002. [17] N. Miura, K. Takahashi, and K. Shima. A user-models construction method for personal-adaptive www(special issue on next generation human interface and interaction). Transactionis of Information Processing Society of Japan, 39(5), 1999. [18] T. Ozono, S. Goto, N. Fujimaki, and T. Shintani. P2p based knowledge source discovery on research support system papits. In The First International Joint Conference on Autonomous Agents & Multiagent Systems(AAMAS2002), 2002. [19] M. Porter. An algorithm for suffix stripping. Automated Library and Informations Systems, 14(3):130–137, 1980. [20] G. Salton. Automatic text processing. Addison-Wesley MA, 1989. [21] G. Salton and M. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983. [22] M. Sugimoto. User modeling and adaptive interaction in information gathering systems. Journal of Japanese Society for Artificial Intelligence, 14(1), 1999. Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE