正在加载图片...
book#2 volume#3.bo daybook#2 book#7 ledger#2 an accounting book as a plysical object THEATER fscript#l book4 playscript#1- written version of a pla faccount book#1 book5 ledger#1. records of commercial account F de of e resource). This resource currently covers all the guages including English and Italian. During this noun synsets, and it is under development for the phase the text is first tokenized (i.e. lexical units remaining lexical categories. For the purposes of a are identified ) then for each word the possible lem- recommender system we have considered 42 disjoint mas as well as their morpho-sy ntactic features are labels which allows a good level of abstraction with- collected. Finally part of speech ambiguities are out loosing relevant information (e. g. in the exper- solved. This is the input for the sy nset identi fica ments we have used SpoRT in place of VaLLEy or tion phase, which is mainly based on the word do BaskETBaLL, which are subs umed by Sp aRT main disambiguation procedure descri bed in Section The domain disambiguation algorithm follows two 2. 1. The WDD algorithm, for each word (currently steps. First, each word in the text is considered and just nouns are considered, due to the limited cover for each domain l abel allowed by that word a score is age of the domain annotation), proposes the domain given. This score is determined by the frequency of label appropriate for the word context. Then, the the label among the senses of the word. At the sec- word synsets associated to the proposed domain are ond step each word is reconsidered, and the domain selected and added to the document representation abel with the highest score is selected as the result As an example, Figure 3 shows a fr agent of the the disambiguation In(Magnini and Strapparava, Synset Document Representation(SDR) for the doc 2000)it is reported that this algorithm reaches 83 ument presented in Figure 1. Words are presented t d 85 accuracy in word domain disambiguation, with the preferred domain label as well as with the spectively for Italian and English, on a corpus elected syn lability reasons we show parallel news. This result makes WDD appealing the sy nonyms belonging to each sy nsets in place of for applications where fine-grained sense distinctions the synset unique identifier used in the actual im are not required, such as document user modelling. plementation. In addition, only the English part of the et is displ ayed 2.2D each document maintained in the Sitelf site is 3 Sense-Based User Modelling processed to extract its semantic content. Given In SitelF the user model is implemented as a se- that we relay on Mult iword Net, the final rep- mantic net whose goal is to represent the contextual resentation consists in a list of synsets relevant for a information derived from the documents. Previous certain document. The text processing is carried out versions of SiteIF were purely word-based, that is whenever a new document is inserted in the web site, the nodes in the net represented the words and the and includes two basic phases: (i)lemmatization and arcs the word co-occurrences. However the result part-of-speech tagging; (ii)synset identi fication with ing user models were fixed to the precise words of WDD the browsed news. One key issue in automating the As for lemmatiz ation and part-of-speech tagging retrieval of potentially interesting news was to find we use the linguist x tools produced by In XightM, document represent ations that are semantically rich which allow to process texts in a number of lan- and accurate, keeping to a minimal level the partic-“book” {book#1 - published composition} {book#2 volume#3 - book as a physical object} {daybook#2 book#7 ledger#2 - an accounting book as a physical object} {book#6 - book of the Bible} {script#1 book#4 playscript#1 - written version of a play} {account_book#1 book#5 ledger#1 - records of commercial account} {record#5 recordbook#1 book#3 - compilation of know facts regarding something or someone} PUBLISHING PUBLISHING RELIGION THEATER COMMERCE FACTOTUM Figure 2: An example of polysemy reduction the resource). This resource currently covers all the noun synsets, and it is under development for the remaining lexical categories. For the purposes of a recommender system we have considered 42 disjoint labels which allows a good level of abstraction with￾out loosing relevant information (e.g. in the exper￾iments we have used Sport in place of Volley or Basketball, which are subsumed by Sport). The domain disambiguation algorithm follows two steps. First, each word in the text is considered and for each domain label allowed by that word a score is given. This score is determined by the frequency of the label among the senses of the word. At the sec￾ond step each word is reconsidered, and the domain label with the highest score is selected as the result of the disambiguation. In (Magnini and Strapparava, 2000) it is reported that this algorithm reaches .83 and .85 accuracy in word domain disambiguation, respectively for Italian and English, on a corpus of parallel news. This result makes WDD appealing for applications where ne-grained sense distinctions are not required, such as document user modelling. 2.2 Document Representations Each document maintained in the SiteIf site is processed to extract its semantic content. Given that we relay on MultiWordNet, the nal rep￾resentation consists in a list of synsets relevant for a certain document. The text processing is carried out whenever a new document is inserted in the web site, and includes two basic phases: (i) lemmatization and part-of-speech tagging; (ii) synset identi cation with WDD. As for lemmatization and part-of-speech tagging we use the LinguistX tools produced by InXightTM , which allow to process texts in a number of lan￾guages including English and Italian. During this phase the text is rst tokenized (i.e. lexical units are identi ed), then for each word the possible lem- mas as well as their morpho-syntactic features are collected. Finally part of speech ambiguities are solved. This is the input for the synset identi ca￾tion phase, which is mainly based on the word do￾main disambiguation procedure described in Section 2.1. The WDD algorithm, for each word (currently just nouns are considered, due to the limited cover￾age of the domain annotation), proposes the domain label appropriate for the word context. Then, the word synsets associated to the proposed domain are selected and added to the document representation. As an example, Figure 3 shows a fragment of the Synset Document Representation (SDR) for the doc- ument presented in Figure 1. Words are presented with the preferred domain label as well as with the selected synsets. For readability reasons we show the synonyms belonging to each synsets in place of the synset unique identi er used in the actual im￾plementation. In addition, only the English part of the synset is displayed. 3 Sense-Based User Modelling In SiteIF the user model is implemented as a se- mantic net whose goal is to represent the contextual information derived from the documents. Previous versions of SiteIF were purely word-based, that is the nodes in the net represented the words and the arcs the word co-occurrences. However the result￾ing user models were xed to the precise words of the browsed news. One key issue in automating the retrieval of potentially interesting news was to nd document representations that are semantically rich and accurate, keeping to a minimal level the partic-
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有