technique to semi-automatically label_中国高校课件下载中心

点击下载：《电子商务 E-business》阅读文献：Blogger-Centric Contextual Advertising

正在加载图片...

T.-K Fan, C-H Chang/ Expert Systems with Applications 38(2011 )1777-1 Algorithm: Ad assignment Strategies Table 1 Contingency table for chi-square Input: a blog post P, a profile set an ads collection Output: blogger-centric ads BCa that are related to on-intentior nal Noi Noo personal interests. olumn NI1+ No N1o+ Noo N= Nu+ No +N1o+ Noo 1. Positive Intention Pl Recognition O 2. Target(s)T=extract target(s)from PI //wher target(s) are depicted as noun(s) ntentional training data (i.e. negative instances), we manually 3. if T=(p construct some queries (e.g, product names, people names, and 4. then Positive Sentiment PS Detection O proper nouns)that will collect entry pages from Wikipedia. We 5. T-extract target(s) from PS hose Wikipedia as our non-intentional training data source be- fT=中 cause it usually describes facts about a specific object and avoids individual subjective intentions and 7. T= target(s)from I 8. end if 9. Tis expanded by Term Expansion O 3.3. 2. Feature selection and feature value 10. Assigning ads which are related to T For feature selection, we considered a subset of word unigrams chosen via the Pearson chi-square test( Chernoff 1954). Yang and Pedersen(1997) suggest that the chi-sqi 四c Fig 3. Pseudo code of ads assignment strategies. is an effective approach to feature selection. To find dependent a feature f is with respect to the intention set or the non-intention set, we set up a null hypothesis that f is independent Home > Buy Want It Now Mobile s Home Phones Mobile Phone of the two categories with respect to its occurrences in the two G600 or any 5 megapixel camera phone sets. A Pearson chi-square test compares observed frequencies of Description f to its expected frequencies to test this hypothesis according to a contingency table, as shown in Table 1. The Ni in Table 1 is counted as the number of sentences containing (or not containing) am looking for a Coach style faceplate to fit a fin the intentional (or non-intentional)dataset. The independence Motorola Razrv3xx phone of f is tested by calculating its chi-square value it's the one with the" c on it Dam wanting the brown color if possible or the black. x2()= S(Ni-Eij) where Ey is the expected frequency of case i calculated by Fig 4. Example of a post with buyer's requirements E1=m则xOmn,j∈0.1 technique to semi-automatically label training data for this task A high chi-square value indicates that the hypothesis of inde Sincemanyauctionwebsites(e.gebay.comandyahoo.com)pro-pendencewhichimpliesthatexpectedandobservedcountsare vide special forums for buyers to post their needs, it naturally be- Similar, is incorrect. In other words, the larger the chi-square value comes a source of intention- filled sentences. Fig. 4 shows an the more class-dependent f is with respect to the intentional set or example of such a post. In this study we collected a large set of posts containing bu juirementsfromebay.com.foreachthenon-intentionalsetInthisstudyweselectedthetop-kfwitha uyers post BP, we extracted the content for the Description field high chi-square value as input feature by a simple program mainly coded with regular expressions. How- used the standard bag-of-features framework. Let ever,since many buyers describe their requirements(e.g- product predefined set of m features that can appear in a document. Let name,productt chat do not contain intentions by the following wi(d) be the weight of f as it occurs in document d Then each pe) in a concise sentence, the labeling system filters document d is criteria ented by the document vector d=(w(d) W2(d)., Wm(d)). As for the weighting value, it can be assigned as Part-of-Speech(POS) tag: Since an intentional sentence usually either boolean value or as a t-idf (term requency inverse doc contains a verb that is not a form of"to be", we keep candidate ument frequency)value. Here we used tf-idf, which is a statistical sentences that contain terms tagged as verbs(e.g, VB and vBG) measure that evaluates how important a word is to a document. nd remove sentences that contain only noun phrases(nn he tf-idf function assumes that the more frequently a certain term NNS). For example, the second and fourth sentences shown in t occurs in documents d the more important it is for d, and fur thermore the more documents d that term ti occurs in, the smaller Fig 4 are two useful sentences for training data, while the sen- its contribution is in characterizing the semantics of a document in tence. It's the one with the"C" on it, would be discarded. The length of the sentence: Short polite words are usually used in which it occurs. In addition, weights computed by tf-idf techniques the forums. Hence, we simply disregard sentences whose re often normalized so as to counter the tendency of tf-idf to emphasize long documents. The type of tf-idf that we used to gel lengths are less than three words. For example, the first and erate normalized weights for data representations in this study is the last sentences presented in Fig 4 would be neglected. All the candidate sentences that conform to the above rules are tf-idf =tf( i, d;).log- DI regarded as intentional data (i.e, positive instances ). As for non- http://pages.ebay.co.uk/wantitnow/technique to semi-automatically label training data for this task. Since many auction web sites (e.g., ebay.com and yahoo.com) provide special forums for buyers to post their needs, it naturally becomes a source of intention-filled sentences. Fig. 4 shows an example of such a post. In this study, we collected a large set of posts containing buyers’ requirements from ebay.com. 4 For each buyer’s post BP, we extracted the content for the Description field by a simple program mainly coded with regular expressions. However, since many buyers describe their requirements (e.g., product name, product type) in a concise sentence, the labeling system filters simple sentences that do not contain intentions by the following criteria. Part-of-Speech (POS) tag: Since an intentional sentence usually contains a verb that is not a form of ‘‘to be”, we keep candidate sentences that contain terms tagged as verbs (e.g., VB and VBG), and remove sentences that contain only noun phrases (NN & NNS). For example, the second and fourth sentences shown in Fig. 4 are two useful sentences for training data, while the sentence, ‘‘It’s the one with the ‘‘C” on it,” would be discarded. The length of the sentence: Short polite words are usually used in the forums. Hence, we simply disregard sentences whose lengths are less than three words. For example, the first and the last sentences presented in Fig. 4 would be neglected. All the candidate sentences that conform to the above rules are regarded as intentional data (i.e., positive instances). As for nonintentional training data (i.e., negative instances), we manually construct some queries (e.g., product names, people names, and proper nouns) that will collect entry pages from Wikipedia.5 We chose Wikipedia as our non-intentional training data source because it usually describes facts about a specific object and avoids individual subjective intentions and opinions (Zhang, Yu, & Meng, 2007). 3.3.2. Feature selection and feature value For feature selection, we considered a subset of word unigrams chosen via the Pearson chi-square test (Chernoff & Lehmann, 1954). Yang and Pedersen (1997) suggest that the chi-square test is an effective approach to feature selection. To find out how dependent a feature f is with respect to the intention set or the non-intention set, we set up a null hypothesis that f is independent of the two categories with respect to its occurrences in the two sets. A Pearson chi-square test compares observed frequencies of f to its expected frequencies to test this hypothesis according to a contingency table, as shown in Table 1. The Nij in Table 1 is counted as the number of sentences containing (or not containing) f in the intentional (or non-intentional) dataset. The independence of f is tested by calculating its chi-square value x2ðfÞ ¼ X i2f0;1g X j2f0;1g ðNij EijÞ 2 Eij where Eij is the expected frequency of case ij calculated by Eij ¼ rowi columnj N ; i; j 2 f0; 1g A high chi-square value indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect. In other words, the larger the chi-square value, the more class-dependent f is with respect to the intentional set or the non-intentional set. In this study, we selected the top-K f with a high chi-square value as input features. To apply these machine learning algorithms on our dataset, we used the standard bag-of-features framework. Let {f1,...,fm} be a predefined set of m features that can appear in a document. Let wi(d) be the weight of fi as it occurs in document d. Then each document d is represented by the document vector ~d ¼ ðw1ðdÞ; w2(d),...,wm(d)). As for the weighting value, it can be assigned as either a boolean value or as a tf–idf (term frequency – inverse document frequency) value. Here we used tf–idf, which is a statistical measure that evaluates how important a word is to a document. The tf–idf function assumes that the more frequently a certain term ti occurs in documents dj, the more important it is for dj, and furthermore, the more documents dj that term ti occurs in, the smaller its contribution is in characterizing the semantics of a document in which it occurs. In addition, weights computed by tf–idf techniques are often normalized so as to counter the tendency of tf–idf to emphasize long documents. The type of tf–idf that we used to generate normalized weights for data representations in this study is tf — idf ¼ tfðti; djÞ log jDj #DðtiÞ Table 1 Contingency table for chi-square. f :f Row Intentional set N11 N10 N11 + N10 Non-intentional set N01 N00 N01 + N00 Column N11 + N01 N10 + N00 N = N11 + N01 + N10 + N00 Fig. 3. Pseudo code of ads assignment strategies. Hi, I am looking for a Coach style faceplate to fit a Motorola RazrV3xx phone. It's the one with the "C" on it. I am wanting the brown color if possible or the black. thanks! Hi, I am looking for a Coach style faceplate to fit a Motorola RazrV3xx phone. It's the one with the "C" on it. I am wanting the brown color if possible or the black. thanks! Fig. 4. Example of a post with buyer’s requirements. 4 http://pages.ebay.co.uk/wantitnow/. 5 http://en.wikipedia.org. 1780 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788

<<向上翻页向下翻页>>

点击下载：《电子商务 E-business》阅读文献：Blogger-Centric Contextual Advertising