Using WordNet to Improve User Modelling in a Web document R ecommender systen Bernardo magnini and Carlo strapparava ITC-irst, Istituto per la ricerca Scientifica e Tecnologica, I-38050 Trento, ITALY email: magnini, strappa airst it c. it Abstract Armstrong et al., 1995; Minio and Tasso, 1996) We propose to use WORDNET in the context of a that exploit a user model to propose relevant docu- ments passed over are processed and the relevant which takes into acoount some properties of words senses are extracted to build a semantic network. in the document, such as their frequency and their co-occurrence. However, assuming that interest is disambiguation, a te echnique that relies on domain seen documents, a purely word based user model is abels associ ated to WORDNET synsets. We also often not accurate enough. The issue is even more report the results of an experiment that has been Important in the Web world, where documents have carried out to give a quantitative estimation of the to do with many different topics and the chance to use of such a content-based user model misinterpret word senses is a real oblem In this paper we propose to use a content-based 1 Introduction document representation as a starting point to build a model of the user's interests. as the user browses Despite its popularity across the computational lin stics community, WORD NET(Miller, 1990),as he documents, the system builds the user model as a semantic network w hose nodes as many other lexical resources, is still scarcely (not just words)of the documents requested by the er. Then, the filtering phase takes advant age of for this is that the granularity of sense distinctions the word senses to retrieve new documents with high makes it hard word sense disambiguation. The prob- em is being addressed following two converging di rections. from the resource side WoRDNET can be The use of senses rather than words implies that extended, for inst ance, adding information that clus- the resulting user model is not only more accurate ters simil ar senses. From the application side, it is but al so independent from the l anguage of the doc- important to select scenarios w here the lost of sense ments browsed. This is particularly important for granularity is not crucial and the benefits of a multilingual web sites, that are becoming very com based approach are remarkable. This paper explores mon especially in news sites or in electronic com both these two directions. We make use of WordNet merce domains domains(Magnini and Cavaglia, 2000), an extension also descri bes an empirical evaluati of WoRD net where synsets are clustered by means of a content-based versus a traditional word-based of domain labels. As for the scenario, we propose user modelling. This experiment shows a substan- improve SitelF(Stefani and Strapparava, 1998; ti al improvement in performance with respect to the Strapparava et al., 2000), an already existent docu- word based approach ment recommendation system, introducing a sense The paper is organized as follows. Section 2 gives based analysis of the documents a sketch of the kind of documents the system deals SitelF is a personal agent for a multilingual news with and describes how MULTI WORD NET and the site. that t to account the user's brows- disambiguation algorithms can be exploited to repre ing by "watching over the user's shoulder".It learns sent the documents in terms oflexical concepts. Sec users interests from the requested pages that are an- tion 3 describes how the user model is built, main alyzed to generate or to updat del of the user tained and used to eley ant de Exploiting this model, the system tries to anticipate to the user. Section 4 gives an acoount of the exper hich documents in the web site oould be interesting iment that evaluates and compares a sy nset-based for the use user model versus a word-based user model. Some Many g.(Lieberman et al., 1999; final comments about future developments oonclude
Using WordNet to Improve User Modelling in a Web Document Recommender System Bernardo Magnini and Carlo Strapparava ITC-irst, Istituto per la Ricerca Scientica e Tecnologica, I-38050 Trento, ITALY email: fmagnini, strappag@irst.itc.it Abstract We propose to use WordNet in the context of a news recommendation system on the web. Docu- ments passed over are processed and the relevant senses are extracted to build a semantic network, which is used to dynamically predicts new docu- ments. As for disambiguation we use word domain disambiguation, a technique that relies on domain labels associated to WordNet synsets. We also report the results of an experiment that has been carried out to give a quantitative estimation of the use of such a content-based user model. 1 Introduction Despite its popularity across the computational linguistics community, WordNet (Miller, 1990), as well as many other lexical resources, is still scarcely used in real NLP based applications. One reason for this is that the granularity of sense distinctions makes it hard word sense disambiguation. The problem is being addressed following two converging directions. From the resource side, WordNet can be extended, for instance, adding information that clusters similar senses. From the application side, it is important to select scenarios where the lost of sense granularity is not crucial and the benets of a sensebased approach are remarkable. This paper explores both these two directions. We make use of WordNet domains (Magnini and Cavaglia, 2000), an extension of WordNet where synsets are clustered by means of domain labels. As for the scenario, we propose to improve SiteIF (Stefani and Strapparava, 1998; Strapparava et al., 2000), an already existent docu- ment recommendation system, introducing a sensebased analysis of the documents. SiteIF is a personal agent for a multilingual news web site, that takes into account the user's browsing by \watching over the user's shoulder". It learns user's interests from the requested pages that are analyzed to generate or to update a model of the user. Exploiting this model, the system tries to anticipate which documents in the web site could be interesting for the user. Many systems (e.g. (Lieberman et al., 1999; Armstrong et al., 1995; Minio and Tasso, 1996) that exploit a user model to propose relevant docu- ments, build a representation of the user's interest which takes into account some properties of words in the document, such as their frequency and their co-occurrence. However, assuming that interest is strictly related to the semantic content of the already seen documents, a purely word based user model is often not accurate enough. The issue is even more important in the Web world, where documents have to do with many dierent topics and the chance to misinterpret word senses is a real problem. In this paper we propose to use a content-based document representation as a starting point to build a model of the user's interests. As the user browses the documents, the system builds the user model as a semantic network whose nodes represent senses (not just words) of the documents requested by the user. Then, the ltering phase takes advantage of the word senses to retrieve new documents with high semantic relevance with respect to the user model. The use of senses rather than words implies that the resulting user model is not only more accurate but also independent from the language of the doc- uments browsed. This is particularly important for multilingual web sites, that are becoming very common especially in news sites or in electronic commerce domains. The paper also describes an empirical evaluation of a content-based versus a traditional word-based user modelling. This experiment shows a substantial improvement in performance with respect to the word based approach. The paper is organized as follows. Section 2 gives a sketch of the kind of documents the system deals with and describes how MultiWordNet and the disambiguation algorithms can be exploited to represent the documents in terms of lexical concepts. Section 3 describes how the user model is built, maintained and used to propose new relevant documents to the user. Section 4 gives an account of the experiment that evaluates and compares a synset-based user model versus a word-based user model. Some nal comments about future developments conclude
CULTURE: GIOTTO PAID BY CULTURA: GIOTTO PAGATO DA MONKS TO WRITE ANTI- FRATI PER SCRIVERE POESIA FRANCISCAN POETR ANTI-FRANCESCANA Rome, 10 Jan.-(Adnkronos)-Giot Roma, 10 gen -(Adnkronos)-Giotto fu paid to attack a fac tion of the pagato' per att accare una fazione de Franciscans, the Spiritual ones, who Francescani, quella degli Spirit ual,che d church decorat ion in si opponevano alla decorazione delle of poverello di Assisi. This has been hiese in onore del poverello di as aled in the erca di uno st udioso who is a professor italiano docente alla Yale Universit v sity, Stefano Ugo Baldassarri, who te fano Ugo baldassarri, che rit iene di t hinks he has solved the mystery of the aver svelato il mistero dell only known poetry by the famous onosciut a del celebre pittore toscano can painter: the Giotto verses have in versi giot teschi, infat ti, avevano sem to be a criticism of the ide vano agli ideali di San of St. Francis and all the more so francesco, tanto piu mossa propri since their aut hor was also the man dallautore dei celebri affreschi della ho painted the fa Basilica at assisi Figure 1: Sample of parallel news text modelling. This line is also supported by several works(see for example(Gonz alo et al., 1998))which 2 Content based document remark that for many practical purposes(eg.cross R ingual information retrieval )the fine-grained sense The Sitelf web site has been built using a news istinctions provided by WORDNET are not neces- corpus kindly put at our disposal by ADNKRo- sary. To reduce the WORDNET poly semy, and, as NoS, an important Itali an news provider. The cor a consequence, the complexity of word sense di pus consists of about 5000 parallel news (i.e. each ambiguation, we have used Word Domain Disam news has both an It ali an and an English version biguation(WDD), a technique proposed in( Magnini partitioned by ADNKRONoS in a number of fixed and Str appar ava, 2000) based on sense clustering categories: culture, food, holid ays, medicine, through the annotation of the MULTI WORD NET fashion, motors and news. The average length of synsets with domain labels. Section 2.1 gives some the news is about 265 words. Figure 1 shows an details about WDD, while Section 2.2 shows how (English-Italian) news WDD is applied to represent documents in our con- text The main working hypot hesis underlying our ap proach to user modelling is that a content based 2.1 Word Domain Disambiguation analysis of the document can improve the accuracy Word Domain Disambiguation is a variant of Word of the model. There are two crucial questions to ad- Sense Disambiguation w here for each word in a text dress: first, a repository for word senses has to be a domain label (among those allowed by the word dentified; second the problem of word sense disam- has to be chosen instead of a sense label. Domai biguation, with respect to the sense repository, has labels, such as MEDICINE and ARCHITECTURe, pro o be solved vide a natural way to establish semantic rel ations As for sense repository we have adopted MUL- among word senses, grouping them into homoge TI WORD NET (Artale et al., 1997), a multilingual neous clusters. Figure 2 shows an example. The extension of the English WORDNET. The It ali word“book” has seven different senses in WorD part of MULTI WORD NET currently covers about NET 1.6: three of them can be grouped under the 35,000 lemmas, completely aligned with the English PUBLISHING domain, causing the reduction of the WORDNET 1.6(i.e with correspondences to English polysemy from 7 to 5 senses In multi have been anno As far as word disambiguation is concerned, we tated with one or more domain labels selected from have addressed the problem st arting with the hy- a set of about two hundred labels hierarchically or pothesis that many sense distinctions are not rel- ganized(see(Magnini and Cavaglia, 2000) for the document representation useful in user annotation met hodology and for the evaluation of
CULTURE: GIOTTO- PAID BY MONKS TO WRITE ANTIFRANCISCAN POETRY Rome,10 Jan. -(Adnkronos)- Giotto was `paid' to attack a faction of the Franciscans, the Spiritual ones, who opposed church decoration in honour of Poverello di Assisi. This has been revealed in the research of an Italian scholar who is a professor at Yale University, Stefano Ugo Baldassarri, who thinks he has solved the mystery of the only known poetry by the famous Tuscan painter: the Giotto verses have in fact always provoked wonder because they seem to be a criticism of the ideals of St. Francis and all the more so since their author was also the man who painted the famous frescoes of the Basilica at Assisi. . . . CULTURA: GIOTTO- PAGATO DA FRATI PER SCRIVERE POESIA ANTI-FRANCESCANA Roma, 10 gen. -(Adnkronos)- Giotto fu `pagato' per attaccare una fazione dei Francescani, quella degli Spirituali, che si opponevano alla decorazione delle chiese in onore del Poverello di Assisi. Lo rivela una ricerca di uno studioso italiano docente alla Yale University, Stefano Ugo Baldassarri, che ritiene di aver svelato il mistero dell'unica poesia conosciuta del celebre pittore toscano: i versi giotteschi, infatti, avevano sem- pre destato meraviglia perche apparivano come una critica agli ideali di San Francesco, tanto piu' mossa proprio dall'autore dei celebri areschi della Basilica di Assisi. ... Figure 1: Sample of parallel news texts. the paper. 2 Content Based Document Representation The SiteIF web site has been built using a news corpus kindly put at our disposal by AdnKronos, an important Italian news provider. The corpus consists of about 5000 parallel news (i.e. each news has both an Italian and an English version) partitioned by AdnKronos in a number of xed categories: culture, food, holidays, medicine, fashion, motors and news. The average length of the news is about 265 words. Figure 1 shows an example of parallel (English-Italian) news. The main working hypothesis underlying our approach to user modelling is that a content based analysis of the document can improve the accuracy of the model. There are two crucial questions to address: rst, a repository for word senses has to be identied; second, the problem of word sense disambiguation, with respect to the sense repository, has to be solved. As for sense repository we have adopted MultiWordNet (Artale et al., 1997), a multilingual extension of the English WordNet. The Italian part of MultiWordNet currently covers about 35,000 lemmas, completely aligned with the English WordNet 1.6 (i.e. with correspondences to English senses). As far as word disambiguation is concerned, we have addressed the problem starting with the hypothesis that many sense distinctions are not relevant for a document representation useful in user modelling. This line is also supported by several works (see for example (Gonzalo et al., 1998)) which remark that for many practical purposes (e.g. cross lingual information retrieval) the ne-grained sense distinctions provided by WordNet are not necessary. To reduce the WordNet polysemy, and, as a consequence, the complexity of word sense disambiguation, we have used Word Domain Disambiguation (WDD), a technique proposed in (Magnini and Strapparava, 2000) based on sense clustering through the annotation of the MultiWordNet synsets with domain labels. Section 2.1 gives some details about WDD, while Section 2.2 shows how WDD is applied to represent documents in our context. 2.1 Word Domain Disambiguation Word Domain Disambiguation is a variant of Word Sense Disambiguation where for each word in a text a domain label (among those allowed by the word) has to be chosen instead of a sense label. Domain labels, such as Medicine and Architecture, provide a natural way to establish semantic relations among word senses, grouping them into homogeneous clusters. Figure 2 shows an example. The word \book" has seven dierent senses in WordNet 1.6: three of them can be grouped under the Publishing domain, causing the reduction of the polysemy from 7 to 5 senses. In MultiWordNet the synsets have been annotated with one or more domain labels selected from a set of about two hundred labels hierarchically organized (see (Magnini and Cavaglia, 2000) for the annotation methodology and for the evaluation of
book#2 volume#3.bo daybook#2 book#7 ledger#2 an accounting book as a plysical object THEATER fscript#l book4 playscript#1- written version of a pla faccount book#1 book5 ledger#1. records of commercial account F de of e resource). This resource currently covers all the guages including English and Italian. During this noun synsets, and it is under development for the phase the text is first tokenized (i.e. lexical units remaining lexical categories. For the purposes of a are identified ) then for each word the possible lem- recommender system we have considered 42 disjoint mas as well as their morpho-sy ntactic features are labels which allows a good level of abstraction with- collected. Finally part of speech ambiguities are out loosing relevant information (e. g. in the exper- solved. This is the input for the sy nset identi fica ments we have used SpoRT in place of VaLLEy or tion phase, which is mainly based on the word do BaskETBaLL, which are subs umed by Sp aRT main disambiguation procedure descri bed in Section The domain disambiguation algorithm follows two 2. 1. The WDD algorithm, for each word (currently steps. First, each word in the text is considered and just nouns are considered, due to the limited cover for each domain l abel allowed by that word a score is age of the domain annotation), proposes the domain given. This score is determined by the frequency of label appropriate for the word context. Then, the the label among the senses of the word. At the sec- word synsets associated to the proposed domain are ond step each word is reconsidered, and the domain selected and added to the document representation abel with the highest score is selected as the result As an example, Figure 3 shows a fr agent of the the disambiguation In(Magnini and Strapparava, Synset Document Representation(SDR) for the doc 2000)it is reported that this algorithm reaches 83 ument presented in Figure 1. Words are presented t d 85 accuracy in word domain disambiguation, with the preferred domain label as well as with the spectively for Italian and English, on a corpus elected syn lability reasons we show parallel news. This result makes WDD appealing the sy nonyms belonging to each sy nsets in place of for applications where fine-grained sense distinctions the synset unique identifier used in the actual im are not required, such as document user modelling. plementation. In addition, only the English part of the et is displ ayed 2.2D each document maintained in the Sitelf site is 3 Sense-Based User Modelling processed to extract its semantic content. Given In SitelF the user model is implemented as a se- that we relay on Mult iword Net, the final rep- mantic net whose goal is to represent the contextual resentation consists in a list of synsets relevant for a information derived from the documents. Previous certain document. The text processing is carried out versions of SiteIF were purely word-based, that is whenever a new document is inserted in the web site, the nodes in the net represented the words and the and includes two basic phases: (i)lemmatization and arcs the word co-occurrences. However the result part-of-speech tagging; (ii)synset identi fication with ing user models were fixed to the precise words of WDD the browsed news. One key issue in automating the As for lemmatiz ation and part-of-speech tagging retrieval of potentially interesting news was to find we use the linguist x tools produced by In XightM, document represent ations that are semantically rich which allow to process texts in a number of lan- and accurate, keeping to a minimal level the partic-
“book” {book#1 - published composition} {book#2 volume#3 - book as a physical object} {daybook#2 book#7 ledger#2 - an accounting book as a physical object} {book#6 - book of the Bible} {script#1 book#4 playscript#1 - written version of a play} {account_book#1 book#5 ledger#1 - records of commercial account} {record#5 recordbook#1 book#3 - compilation of know facts regarding something or someone} PUBLISHING PUBLISHING RELIGION THEATER COMMERCE FACTOTUM Figure 2: An example of polysemy reduction the resource). This resource currently covers all the noun synsets, and it is under development for the remaining lexical categories. For the purposes of a recommender system we have considered 42 disjoint labels which allows a good level of abstraction without loosing relevant information (e.g. in the experiments we have used Sport in place of Volley or Basketball, which are subsumed by Sport). The domain disambiguation algorithm follows two steps. First, each word in the text is considered and for each domain label allowed by that word a score is given. This score is determined by the frequency of the label among the senses of the word. At the second step each word is reconsidered, and the domain label with the highest score is selected as the result of the disambiguation. In (Magnini and Strapparava, 2000) it is reported that this algorithm reaches .83 and .85 accuracy in word domain disambiguation, respectively for Italian and English, on a corpus of parallel news. This result makes WDD appealing for applications where ne-grained sense distinctions are not required, such as document user modelling. 2.2 Document Representations Each document maintained in the SiteIf site is processed to extract its semantic content. Given that we relay on MultiWordNet, the nal representation consists in a list of synsets relevant for a certain document. The text processing is carried out whenever a new document is inserted in the web site, and includes two basic phases: (i) lemmatization and part-of-speech tagging; (ii) synset identication with WDD. As for lemmatization and part-of-speech tagging we use the LinguistX tools produced by InXightTM , which allow to process texts in a number of languages including English and Italian. During this phase the text is rst tokenized (i.e. lexical units are identied), then for each word the possible lem- mas as well as their morpho-syntactic features are collected. Finally part of speech ambiguities are solved. This is the input for the synset identication phase, which is mainly based on the word domain disambiguation procedure described in Section 2.1. The WDD algorithm, for each word (currently just nouns are considered, due to the limited coverage of the domain annotation), proposes the domain label appropriate for the word context. Then, the word synsets associated to the proposed domain are selected and added to the document representation. As an example, Figure 3 shows a fragment of the Synset Document Representation (SDR) for the doc- ument presented in Figure 1. Words are presented with the preferred domain label as well as with the selected synsets. For readability reasons we show the synonyms belonging to each synsets in place of the synset unique identier used in the actual implementation. In addition, only the English part of the synset is displayed. 3 Sense-Based User Modelling In SiteIF the user model is implemented as a se- mantic net whose goal is to represent the contextual information derived from the documents. Previous versions of SiteIF were purely word-based, that is the nodes in the net represented the words and the arcs the word co-occurrences. However the resulting user models were xed to the precise words of the browsed news. One key issue in automating the retrieval of potentially interesting news was to nd document representations that are semantically rich and accurate, keeping to a minimal level the partic-
word lea Domain label synsets a-I, junto-l,ci acte lade-l, honor-l, honour-2, I) fhonor-3, honour-4 Factot um 1, student-2 密 ystery-2, mystery_story-l, whodunit-1] unf av orable-judgment-1] 1]fman-3)fman-7) fm an-8) FIGURE 3: SyNSET DOCUMENT REP RESENTaTION foR a fRaGmENT of TEXT palan af THE USER. FIGUrE 4 THE MODEHIING PROESS SHOWING A NEw VERSI ON of SI F Has b eEn REalzED W HERE aN ExampIE af USER moDEL aUGmENTaTI ON IE USER moDEL IS SIIIL ImpIEmENTED as a NETWORk 3.2 Filte SIRUCIURE, WIlH THE DIffEREN CE THaT NODES NOW REp ResEnT Sy NSETS aND aRCS THE CO-OCURRENCE of sy NSEIS. DURING THE fiTTERINGPHasE, THE Sy SIEm compaRES aNy THE waRkING HyPOTHESIS IS THaT THE moDEL CaN HEIP DocumENT(IE. THE REPRESENTaTI ON of aNy DOCUmENTS To DEfiNE SEmaNTI C CIaINS THROU GH W HI CH THE firterin TERmS of syNSETS)IN THE SI TE WITH THE USER moDEL las a better CHaNCE TO CaTCH DoCUmENTS SEmaNTICally A maTCHING moDUIE RECEIVES as INPUT THE INTERN al CoSER IO THE TopICS aIREaDy IoUCHED By THE USER REpresenTation of a DocumENT aND THE CURRENT USER modEl and IT pRODU CES as OUTpuT a gassificalI ON of 3.1 Modelling P has THE DOCUMENT (LE. WHETHER IT IS WORTH OR NOT THE IN THE moDEIIINGpHasE SITEIF CaN SIDERS THE BROW SED U SER'S aTIENII ON). THE RElEVaNCE of aNy SINGIE DOCU- DOCUmENTS DURING a user navIGatIoN SeSSI oN the mENTIS ESTImaTED USING THE SEmaNI C NETw ORk V aIdE sy siEm USES THE DOCUmENT REP RESENTal ON BROWW SED NEV EvERy SyNSET Has a SCORE THaT IS 1998)). THE IDEa BEHIND THE SITEIF aLGORITHm CaN- fREqUENcy ov En tI THE TECHNIqUE(SEE fOR DETaI IS(STEfaNI aND STRappaRava, SI SIS of CHEckIN G, fo NEwS CORP US. THE SCORE IS HI GHER fOR IESS fREqUENT ti on of The DOCUmENT. WHETHER THE CONTEXT IN WHICH SyNSEIS, avaIDING THaT VERy commaN mEaNINGS BE- IT OCCURS Has BEEN aIREaDy foUND IN PREVIOUSLy vIS camE TOO PREvaIlING IN THE USER modEL LIkEWISE, IN ITED DOCUMENTS(I.E. aIREaDy SIORED IN THE SEmaNIC THE W ORDB asp TE NET). THIS CONTEXT IS REPRESENTED By a Co-OOCURRENCE UmENT REP RESENTatIon, WHERE EVERY WORD Has a SCOR RELalI aNSHIp, I. E. By THE COUPIES af TERMS INCIUDED IN NvERSEly PROp ORIIONaL TO THE WORD fREqUENC IN THE E DOCUmENT WHICH HavE aIREaDy Co-OCCURRED BEfORE NEWS CORPUS N OIHER DOCUMENTS. T'HIS INfORmaTI ON IS REPRESENTE THE SySIEm BUILDS OR aU GmENIS THE USER moDEL as By aRcs af THE SEmaNTIC NET. a sEmani cnEt WHOSE NODES aRE SyNSEIs and aRCS BE- HERE BEIOW WE PRESENT THE foRmULa USED TO CalcU- Late ThE relevance of a DocumENT USING THE SEMaNTIC TwEEn NODES arE THE CO-oCCurren Ce RElal oN OCCUR- NG CE IN a DocUmENT) of Two SyNSETS. Wei Ghts NETWORk V aUE TECHNIqUE aN NODES aRE IN CREmENIED By THE SCORE of THE Sy NSETS RCS aRE THE mEan of THE CONNECTED Relev ance(c)= u(i)·/ regai)+ NODES WEIGHTS. FoR EaC BROWSED NEW S. THE WEI GHTS af THE NET aRE P ERI ODI cally RECON SI DERED aND pOSSIBly LOW ERED, DEp ENDING ON THE TmE passED fRom THE LasT cta sEfU l∈(SIsd BE REmONED fRam THE NET. IN THIS way IT Is p ossI BIE TO CONSI DER CHaN GES of THE USER's INIERESIS aND TO WHERE w(i) Is THE WEI GHT of SyN SET-NODE i IN THE avaID THaTUNINIERESIING CaNCEp'Is REmaIN IN THE USER UM NEIWORk, w(i, j)IS THE WEI GHT of THE aRC BETWEEN
Word lemma Domain label Synsets faction Factotum ffaction-2, sect-2g fcabal-1, faction-1, junta-1, junto-1, camarilla-1g franciscan Religion fGray Friar-1, Franciscan-1g church Religion fchurch-1, Christian church-1, Christianity-2g fchurch-2, church building-1g fchurch service-1, church-3g decoration Factotum fdecoration-3g honour Factotum faward-2, accolade-1, honor-1, honour-2, laurels-1g fhonor-3, honour-4g research Factotum fresearch-1g finquiry-1, enquiry-2, research-2g scholar Pedagogy fscholar-1, scholarly person-1, student-2g flearner-1, scholar-2g fscholar-3g professor Pedagogy fprofessor-1g mystery Literature fmystery-2, mystery story-1, whodunit-1g poetry Literature fpoetry-1, poesy-1, verse-1g fpoetry-2g painter Art fpainter-1g verse Literature fpoetry-1, poesy-1, verse-1g fverse-2, rhyme-2g fverse-3, verse line-1g wonder Factotum fwonder-2, marvel-1g criticism Factotum fcriticism-1, unfavorable judgment-1g ideal Factotum fideal-1g fideal-2g man Factotum fman-1, adult male-1g fman-3g fman-7g fman-8g author Literature fwriter-1, author-1g fresco Art ffresco-1g ffresco-2g basilica Religion fbasilica-1g Figure 3: Synset Document Representation for a fragment of text ipation of the user. A new version of SiteIF has been realized where the user model is still implemented as a network structure, with the dierence that nodes now represent synsets and arcs the co-occurrence of synsets. The working hypothesis is that the model can help to dene semantic chains through which the ltering has a better chance to catch documents semantically closer to the topics already touched by the user. 3.1 Modelling Phase In the modelling phase SiteIF considers the browsed documents during a user navigation session. The system uses the document representation of the browsed news. Every synset has a score that is inversely proportional to its frequency over all the news corpus. The score is higher for less frequent synsets, avoiding that very common meanings become too prevailing in the user model. Likewise, in the word-based case we considered a word list doc- ument representation, where every word has a score inversely proportional to the word frequency in the news corpus. The system builds or augments the user model as a semantic net whose nodes are synsets and arcs between nodes are the co-occurrence relation (cooccuring presence in a document) of two synsets. Weights on nodes are incremented by the score of the synsets, while weights on arcs are the mean of the connected nodes weights. For each browsed news, the weights of the net are periodically reconsidered and possibly lowered, depending on the time passed from the last update. Also no longer useful nodes and arcs may be removed from the net. In this way it is possible to consider changes of the user's interests and to avoid that uninteresting concepts remain in the user model. Figure 4 sketches the modelling process showing an example of user model augmentation. 3.2 Filtering Phase During the ltering phase, the system compares any document (i.e. the representation of any documents in terms of synsets) in the site with the user model. A matching module receives as input the internal representation of a document and the current user model and it produces as output a classication of the document (i.e. whether it is worth or not the user's attention). The relevance of any single docu- ment is estimated using the Semantic Network Value Technique (see for details (Stefani and Strapparava, 1998)). The idea behind the SiteIF algorithm consists of checking, for every concept in the representation of the document, whether the context in which it occurs has been already found in previously visited documents (i.e. already stored in the semantic net). This context is represented by a co-occurrence relationship, i.e. by the couples of terms included in the document which have already co-occurred before in other documents. This information is represented by arcs of the semantic net. Here below we present the formula used to calculate the relevance of a document using the Semantic Network Value Technique: Relevance(doc) = X i2fsyns(doc)g w(i) f reqdoc (i) + + X i;j2fsyns(doc)g w(i; j) w(j) f reqdoc (j) where w(i) is the weight of synset-node i in the UM network, w(i; j) is the weight of the arc between i and j.
⑥ MODELING PHASE ② FILTERING PHASE magnetic disk,… updates the monetary syston,… softwar,… User Model before the upd ② RELEVANCE OF THE DOCUMENT d fitern See fi]ure 4 for a sum m ary sk etch of the filter) ]ory, and a new docum ent was sou ht in the test corpus w ith in that cate)ory. If a relev ant docum ent as found, . t w as added to the ad v isor proposa 4 Evaluation oth no docum ents for that cate] ory is Wew anted to estim atehon much thene version of posed. Eventually, an add ition ald ocum ent, outside SitelF (synset based)actuaIly im proves the perfor- the cate ores brow sed by the user cou d be add ed m an ces w th respect to the previous version of the by the adv sor. On av eraje, the adv isor proposed 3 sy stem (w ord based ) Hor ever, settin] a com pa d ocum ents for a user d ocum ent tive test am on)user m odek, )oin) bey ond a)eneric v isor proposals user satisfaction is not strai htforw ard. To eval at w th the requ Its of the t o sy stem s. To sim ulte w hether and how the exp lo tat ion of the syn set rep the adv ior beh av ior(ie. it is allow ed that for resen tation m proves the accuracy of the sem anti )Ien cate ory no proposal is seLected), althe sys m ents w h ose releyan ce w as less th an a fixed netur ork m od ella) and filter), we arran]ed an ex- d ifferen ce(20%)from thebestdocum ent,w ere elin perim ent w hose ]oalw as to com pare th of adv Bor sy stem s against the jud ) em ents of a hum an in ated. After this selection, on avera]e, the system the t o proposed 10 docum ents for a user d ocum ent set We proceed ed n the follow in)w ay. First, a te Stand ard fijures for prec is ion and recaIlh ave been of about one ed En)lich new s from the calculated con sider the m atch es am on] the advi ADN KRON oS corPus w ere selected hom o)en eously sor and the sy stem s docum ents. Precision i the w th respect to the overalld istribution i cate] ories ratio of recom m end ed docum ents that are relevant (ie cul t ure, notors, etc.). The test set h as be is the ratio of relevant m ade available as a Web site, and then 12 ITC-rst that are recom m ended. In term s of our experim ent research ers w ere asked to brow se the site sim ulti we have precision a u ser v isatin] the new s site. Users w ere instructed w here H is the set of the hum an adv isor proposals to select a new s, accord n) to th ei person al inter- and s is the set of the sy stem proposals sts, to com pletely read It, and th en to select an Table I shas the resutt of the evaluation th other new s, a)an accord into their interests. Th i first colum n takes into account the docum ent new s, process w asrep eated untilten new s w ere picked out. the second on ly the ADNKRoNoS cate] ories. We After th is ph ase, a hum an adv isor, who was ac- can note that precision con s: d erably n creases(34%) quanted w ith the test corpus, w as asked to analyze w th the syn set based u ser m odel Th is confirm s cs, and to propose thew orkin hypothes is th at su bst tutin]w ords w ith new potential interestin] docum ents from the cor- sen ses both in the m odell) and in the filterin) pus. The adv isor w as requ ested to fo lo the sam e ph ase produces a m ore accurate output. The m ain reason, as expected, is th at a syn set- based retrieval first rouped accord in to ther ADNKRON oS cate allow s to prefer docum ents w th hih deree of
USER MODELING PHASE SiteIF considers the user visited documents in a navigation session User Model before the update visits updates the user model Comparison {magnetic disk, …} {software, …} {operating system, … } List of synsets {…} WDD algorithm {monetary system, …} List of synsets WDD algorithm FILTERING PHASE {metal money, …} {currency, …} SiteIF compares any site document with the user model {…} User Model after the update 3 5 3 1 3 4 3 4 2 {operating system, …} {monetary system, …} {software, …} {magnetic disk, …} RELEVANCE OF THE DOCUMENT 2 3 4 2,5 3 3,5 {operating system, … } {monetary system, …} {software, …} 1 1 2 2 1 2 Figure 4: Modelling and Filtering Processes See gure 4 for a summary sketch of the ltering process. 4 Evaluation We wanted to estimate how much the new version of SiteIF (synset based) actually improves the performances with respect to the previous version of the system (word based). However, setting a comparative test among user models, going beyond a generic user satisfaction is not straightforward. To evaluate whether and how the exploitation of the synset representation improves the accuracy of the semantic network modelling and ltering, we arranged an experiment whose goal was to compare the output of the two systems against the judgements of a human advisor. We proceeded in the following way. First, a test set of about one hundred English news from the AdnKronos corpus were selected homogeneously with respect to the overall distribution in categories (i.e. culture, motors, etc. . . ). The test set has been made available as a Web site, and then 12 ITC-irst researchers were asked to browse the site, simulating a user visiting the news site. Users were instructed to select a news, according to their personal interests, to completely read it, and then to select another news, again according to their interests. This process was repeated until ten news were picked out. After this phase, a human advisor, who was acquainted with the test corpus, was asked to analyze the documents chosen by the users, and to propose new potential interesting documents from the corpus. The advisor was requested to follow the same procedure for each document set: documents were rst grouped according to their AdnKronos category, and a new document was sought in the test corpus within that category. If a relevant document was found, it was added to the advisor proposals, otherwise no documents for that category is proposed. Eventually, an additional document, outside the categories browsed by the user could be added by the advisor. On average, the advisor proposed 3 documents for a user document set. At this point we compared the advisor proposals with the results of the two systems. To simulate the advisor behavior (i.e. it is allowed that for a given category no proposal is selected), all the system documents whose relevance was less than a xed dierence (20%) from the best document, were eliminated. After this selection, on average, the system proposed 10 documents for a user document set. Standard gures for precision and recall have been calculated considering the matches among the advisor and the systems documents. Precision is the ratio of recommended documents that are relevant, while the recall is the ratio of relevant documents that are recommended. In terms of our experiment we have precision = jH\Sj jSj and recall = jH\Sj jHj , where H is the set of the human advisor proposals and S is the set of the system proposals. Table 1 shows the result of the evaluation. The rst column takes into account the document news, the second only the AdnKronos categories. We can note that precision considerably increases (34%) with the synset-based user model. This conrms the working hypothesis that substituting words with senses both in the modelling and in the ltering phase produces a more accurate output. The main reason, as expected, is that a synset-based retrieval allows to prefer documents with high degree of se-
Recall word- Base um0.510210.890.40 Synset- Based UM Table 1: Com parison between word-based UM and sy nset-based UM m antic coherence, which is not guaranteed in case of crim ination. In al*iA97: Advances in Artificial a word-based ret reva Intelligence. Springer verlag As for recall, it also gains some points(15%),even C. Peters. andn calzolari if it rem ains quite low. However. this does not seen 1998. Applying eurowordnet to cross-lang a serious draw back for a pure recom mender system text retrieval. Computers and Humanities, 32 where there is no t he need to answer an explicit 3):185-207 Neil w. van duke and Adrian S retrieval systems), but rather the need is for an high Vivacqua. 1999. Let's browse: A collaborat ive quality (i.e. the precision) of the propo sals web browsing ager Proceedings of the 1999 International Confer 5 Con c lu sion terfaces, Collaborative Filtering and Collabora- w versIo of siteif. a re ive Interfaces, pages 65-68 om mender system for a Web site of multilingual B. Magnini and G. Cavaglia. 2000. Integrating news. Exploiting a content-based document repre subject field codes into WordNet. In proceedings sentation. we have described a model of the user's of lrec-2000. Second international conference ts based on word rather that on sim ply on Language Resources and Evaluation, Athe ds. The main ad of this approach that semantic accuracy increases and that the model B Magnini and C. Strapparava. 2000. Experiments is independent n word dom ain disam big tion for pa rallel texts To give a quantitative estim ation of the im prove In Proc. of SIGLEX Workshop on Word Senses ments induced by a content-based approach, a com and Multi-linguality, Hong- Kong, October. held parative experiment sense-based vs. word-based In conJunction with A CL 2000 user model- has been carried out, which has showed G. Miller. 1990. An on-line lexical database. Inter a significant higher precision in the system recom national Journal of Lexicogr aphy, 1 3(4): 235-312 M. Minio and C. Tasso. 1996. User modeling for in here are several areas for future development s form ation filtering on internet services: Exploiting One point is to im prove the disam biguation algo an extended version of the umt shell. In Proc rithms w hich are at the basis of the dot of Workshop on User Modeling for Information resentation. A prom ising direction (proposed in Filtering on the World Wide Web, Kailia-Kuna and Strapparava, 2000)) is to design spe Haw aii, January. held in conjunction with UM 96 ific algorithms which consider the synset intersec- A. Stefani and C. Strapparava. 1998.Personaliziong tion of parallel news access to web sites: The siteif pro ject. In Proc. of A second working direction concerns the possi b second Workshop on Adaptive Hypertext and Hy ity to develop clustering algorithms over the senses permedia, Pittsburgh, June. held in conjunction of the sem antic net work. For exam ple, once the user with hyPerteXt 98 odel net work is built, it could be useful to dynam- C. Strapparava, B. Magnini, and A. Stefani. 2000 ically infer some homogeneous user interest areas. Sense-based user modelling for web sites. In Adap This would allow to arrange in unifo rm dynam ic tive Hypermedi a and Adaptive Web- Based ys- groups the recom mended docum s tems- Lecture Notes in Computer Science 1892 ces R. Armstre D. Freia T. Joachim d T Mitchell. 1995. Webwatcher: A learning ap rld wide web AAAI Spring Symposium g from Heterogeneous and Distributed Environ ments. Stanford. March WordNet for italian and its use for lexical d
News Categories Precision Recal l Precision Recal l Word-Based UM 0.51 0.21 0.89 0.40 Synset-Based UM 0.85 0.36 0.97 0.43 Table 1: Comparison between word-based UM and synset-based UM mantic coherence, which is not guaranteed in case of a word-based retrieval. As for recall, it also gains some points (15%), even if it remains quite low. However, this does not seem a serious drawback for a pure recommender system, where there is no the need to answer an explicit query (as it happens, for instance, in information retrieval systems), but rather the need is for an high quality (i.e. the precision) of the proposals. 5 Conclusions We have presented a new version of SiteIF, a recommender system for a Web site of multilingual news. Exploiting a content-based document representation, we have described a model of the user's interests based on word senses rather that on simply words. The main advantages of this approach are that semantic accuracy increases and that the model is independent from the language of the news. To give a quantitative estimation of the improve- ments induced by a content-based approach, a comparative experiment - sense-based vs. word-based user model - has been carried out, which has showed a signicant higher precision in the system recommendations. There are several areas for future developments. One point is to improve the disambiguation algorithms which are at the basis of the document representation. A promising direction (proposed in (Magnini and Strapparava, 2000)) is to design specic algorithms which consider the synset intersection of parallel news. A second working direction concerns the possibility to develop clustering algorithms over the senses of the semantic network. For example, once the user model network is built, it could be useful to dynamically infer some homogeneous user interest areas. This would allow to arrange in uniform dynamic groups the recommended documents. References R. Armstrong, D. Freitag, T. Joachim, and T. Mitchell. 1995. Webwatcher: A learning apprentice for the world wide web. In Proc. of AAAI Spring Symposium on Information Gathering from Heterogeneous and Distributed Environ- ments, Stanford, March. A. Artale, B. Magnini, and C. Strapparava. 1997. WordNet for italian and its use for lexical discrimination. In AI*IA97: Advances in Articial Intel ligence. Springer Verlag. J. Gonzalo, F. Verdejio, C. Peters, and N. Calzolari. 1998. Applying eurowordnet to cross-language text retrieval. Computers and Humanities, 32(2- 3):185{207. Henry Lieberman, Neil W. Van Dyke, and Adrian S. Vivacqua. 1999. Let's browse: A collaborative web browsing agent. In Proceedings of the 1999 International Conference on Intel ligent User Interfaces, Collaborative Filtering and Collaborative Interfaces, pages 65{68. B. Magnini and G. Cavaglia. 2000. Integrating subject eld codes into WordNet. In Proceedings of LREC-2000, Second International Conference on Language Resources and Evaluation, Athens, Greece, June. B. Magnini and C. Strapparava. 2000. Experiments in word domain disambiguation for parallel texts. In Proc. of SIGLEX Workshop on Word Senses and Multi-linguality, Hong-Kong, October. held in conjunction with ACL2000. G. Miller. 1990. An on-line lexical database. International Journal of Lexicography, 13(4):235{312. M. Minio and C. Tasso. 1996. User modeling for information ltering on internet services: Exploiting an extended version of the UMT shell. In Proc. of Workshop on User Modeling for Information Filtering on the World Wide Web, Kailia-Kuna Hawaii, January. held in conjunction with UM'96. A. Stefani and C. Strapparava. 1998. Personaliziong access to web sites: The siteif project. In Proc. of second Workshop on Adaptive Hypertext and Hypermedia, Pittsburgh, June. held in conjunction with HYPERTEXT 98. C. Strapparava, B. Magnini, and A. Stefani. 2000. Sense-based user modelling for web sites. In Adaptive Hypermedia and Adaptive Web-Based Systems - Lecture Notes in Computer Science 1892. Springer Verlag