Expert Systems with Applications 38(2011)1777-1788 Contents lists available at ScienceDirect Expert Systems with Applications ELSEVIER journalhomepagewww.elsevier.com/locate/eswa Blogger-Centric Contextual Advertising Teng-Kai Fan, Chia-Hui Chang Department of Computer Science 8 information Engineering, National Central University, No 300, Jung-da Rd, Chung-li, Tao-yuan 320, Taiwan, ROC ARTICLE IN FO A BSTRACT Web advertising(online advertising a form of advertising that uses the world wide web to attract cus- Online advertising tomers, has become one of the most commonly-used marketing channels. This paper addresses the con- ept of Blogger-Centric Contextual Advertising, which refers to the assignment of personal ads to any blog age, chosen in according to bloggers'interests As blogs become a platform for expressing personal opin Information retrieva ons, they naturally contain various kinds of statements, including facts, comments and statements about personal interests, of both a positive and negative nature. To extend the concept behind the Long tail the- ory in contextual advertising, we argue that web bloggers, as the constant visitors of their own blog-sites gS. Hence, in this online contextual advertising. The proposed Blogger-Centric Contextual Advertising(BCCA) framework aims to combine contextual advertising matching with text mining in order to select ads that are to personal interests as revealed in a blog and rank them according to their relevance We valid approach experimentally using a set of data that includes both real ads and actual blog pages. The i indicate that our proposed method could effectively identify those ads that are positively-correlated with a bloggers personal interests. Crown Copyright 2010 Published by Elsevier Ltd. All rights reserved. 1 Introduction their companies; and corporate bloggers usually blog for their com- nies in an official capacity. Statistics show that four out of five Blogosphere is a collective term comprising all blogs bloggers(about 79%)are personal bloggers. The majority of blog interconnections. A blog, short for weblog, is a type of web gers have advertising or another method of revenue generation is usually maintained by a blogger who will publish seri. on their blogs. Among bloggers who have advertising on their posts containing news, comments, opinions, diaries, and i blogs, two out of three have contextual ads and one-third have ing articles. As of December 2007, the blog search engine Techno- affiliate advertising on their blogs(Technorati, 2008). On average, rati was tracking more than 112 million blogs. Reports also professional and corporate bloggers are more likely to include indicate that about 1. 2 million new blogs are being created world- search ads, display ads and affiliate marketing because they cer wide each day. According to Technorati's reports in April 2007, the tainly understand what kinds of ads are suitable for their blogs. umber of blogs in the top 100 most popular sites has risen substan- However, the majority of personal bloggers who have no specific ce, Dogs dea which ads are proper to their web sites reply on reliable and information outlets matching mechanisms used in contextual advertising. Hence, in Blogs are also an increasingly attractive platform for advertis- this paper we hope to propose a contextual advertising mechanism ers. The majority of bloggers have advertising on their blogs. Mar- that could increase click rates on personal blogs keters realize that bloggers are creating high-quality content and Contextual advertising is based on studies that show that 80% of attracting growing and loyal audiences(Technorati, 2008). Hence, internet users are interested in receiving personalized content on it is common for blogs to feature advertisements that either finan- sites they visit (Choice Stream, 2005 ) Since the topic of a page cially benefit the blogger or promote the blogger's favorite causes. somehow reflects the interest of visitors, ads delivered to visitors loggers can be classified into three types(Technorati, 2008). should depend upon page content rather than upon stereotypes Personal bloggers blog about topics on personal interests not asso- created according to their geographical locations or upon other ciated with their work, professional bloggers mainly blog about mographic features, such as gender or age ( Kazienko Adamski their industries and professions but not in an official capacity for 2007). As shown in previous studies, strong relevance increases the OneUpWeb, 2005; Wang, Zhang, Choi, D'Eredita, 2002). Some E-mail address: chia@csie ncu. edu. tw(C.-H. Chang studies(Fan Chang, 2009: Zhang, Surendran, Platt, Narasim- han, 2008 )have also demonstrated that focusing on relevant topics 0957-4174/s-see front matter Crown ght o 2010 Published by Elsevier Ltd. All rights reserved. doi:10.1016|eswa2010.07.10
Blogger-Centric Contextual Advertising Teng-Kai Fan, Chia-Hui Chang * Department of Computer Science & Information Engineering, National Central University, No. 300, Jung-da Rd., Chung-li, Tao-yuan 320, Taiwan, ROC article info Keywords: Online advertising Text mining Machine learning Marketing Information retrieval Language model abstract Web advertising (online advertising), a form of advertising that uses the World Wide Web to attract customers, has become one of the most commonly-used marketing channels. This paper addresses the concept of Blogger-Centric Contextual Advertising, which refers to the assignment of personal ads to any blog page, chosen in according to bloggers’ interests. As blogs become a platform for expressing personal opinions, they naturally contain various kinds of statements, including facts, comments and statements about personal interests, of both a positive and negative nature. To extend the concept behind the Long Tail theory in contextual advertising, we argue that web bloggers, as the constant visitors of their own blog-sites, could be potential consumers who will respond to ads on their own blogs. Hence, in this paper, we propose using text mining techniques to discover bloggers’ immediate personal interests in order to improve online contextual advertising. The proposed Blogger-Centric Contextual Advertising (BCCA) framework aims to combine contextual advertising matching with text mining in order to select ads that are related to personal interests as revealed in a blog and rank them according to their relevance. We validate our approach experimentally using a set of data that includes both real ads and actual blog pages. The results indicate that our proposed method could effectively identify those ads that are positively-correlated with a blogger’s personal interests. Crown Copyright 2010 Published by Elsevier Ltd. All rights reserved. 1. Introduction Blogosphere is a collective term comprising all blogs and their interconnections. A blog, short for weblog, is a type of web site that is usually maintained by a blogger who will publish serial journal posts containing news, comments, opinions, diaries, and interesting articles. As of December 2007, the blog search engine Technorati1 was tracking more than 112 million blogs. Reports also indicate that about 1.2 million new blogs are being created worldwide each day. According to Technorati’s reports in April 2007, the number of blogs in the top 100 most popular sites has risen substantially. Hence, blogs continue to become more and more viable news and information outlets. Blogs are also an increasingly attractive platform for advertisers. The majority of bloggers have advertising on their blogs. Marketers realize that bloggers are creating high-quality content and attracting growing and loyal audiences (Technorati, 2008). Hence, it is common for blogs to feature advertisements that either financially benefit the blogger or promote the blogger’s favorite causes. Bloggers can be classified into three types (Technorati, 2008). Personal bloggers blog about topics on personal interests not associated with their work, professional bloggers mainly blog about their industries and professions but not in an official capacity for their companies; and corporate bloggers usually blog for their companies in an official capacity. Statistics show that four out of five bloggers (about 79%) are personal bloggers. The majority of bloggers have advertising or another method of revenue generation on their blogs. Among bloggers who have advertising on their blogs, two out of three have contextual ads and one-third have affiliate advertising on their blogs (Technorati, 2008). On average, professional and corporate bloggers are more likely to include search ads, display ads and affiliate marketing, because they certainly understand what kinds of ads are suitable for their blogs. However, the majority of personal bloggers who have no specific idea which ads are proper to their web sites reply on reliable matching mechanisms used in contextual advertising. Hence, in this paper we hope to propose a contextual advertising mechanism that could increase click rates on personal blogs. Contextual advertising is based on studies that show that 80% of internet users are interested in receiving personalized content on sites they visit (ChoiceStream, 2005). Since the topic of a page somehow reflects the interest of visitors, ads delivered to visitors should depend upon page content rather than upon stereotypes created according to their geographical locations or upon other demographic features, such as gender or age (Kazienko & Adamski, 2007). As shown in previous studies, strong relevance increases the number of click-throughs (Chatterjee, Hoffman, & Novak, 2003; OneUpWeb, 2005; Wang, Zhang, Choi, & D’Eredita, 2002). Some studies (Fan & Chang, 2009; Zhang, Surendran, Platt, & Narasimhan, 2008) have also demonstrated that focusing on relevant topics 0957-4174/$ - see front matter Crown Copyright 2010 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.105 * Corresponding author. E-mail address: chia@csie.ncu.edu.tw (C.-H. Chang). 1 http://Technorati.com. Expert Systems with Applications 38 (2011) 1777–1788 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
T.-K Fan, C-H Chang/ Expert Systems with Applications 38(2011 )1777-1 tising(or content targeted advertising)(Anagnostopoulous et al. Rent Italian Luxury villa 2007: Broder, Fontoura, Josifovski, Riedel, 2007). Sponsored My World search, which delivers ads to users based on users' input query. can be used on sites with a search interface(e.g, search engines 25. off N Bus Passes Huge discounts on selected passes Contextual advertising, on the other hand is displayed on general web sites. These two techniques differ in that a sponsored search Ba: Bus.Hop on hop on analyzes only the user,s query key words, while content-based Very sad. The baby hasnt moved in two advertising parses the contents of a web page to decide which ads to show. However, the goals of each approach are consistent. hinking about The intent is to create a triple-win commercial platform. In other rs right now. Cheap Airline Tickets the Low words, an advertiser pay to purchase valuable adver Fares. Search 170+ Sites Save tisements, the ad agency system shares advertising profits with the Tokyos largest hostel web site owner(the publisher), and consumers can easily respond to ads to purchase products or services. Contextual advertising involves an interaction between four Relevant Ads players(Anagnostopoulos et al. 2007; Broder et al., 2007). The publisher, or the owner of a web site, usually provides interesting Fig. 1. Example of a blog page with correlation ads pages on which ads are shown. The publishers typically aim to en- gage a viewer, encouraging them to stay on their web page and, furthermore, attracting sponsors to place their ads on the page written with positive sentiment produces high click-through rates. The advertiser(the second player)supplies a series of ads to mar- Although a page-relevant topic is a way to capture visitors inter- ket or promote their products. The advertisers register certain est, there is no other way to determine their personal interests. characteristic keywords to describe their products or services However, since bloggers are constant visitors to their own blogs, The ad agency system( the third player) is a mediator between and their interests or intensions are well expressed in the weblogs, the advertisers and the publishers: that is, it is in charge of match- an ad agency could use those expressed intentions to place inter- ng ads to pages. The end user( the fourth player ) who browses For example, Fig. 1 shows a weblog with five ads related to trav web pages, might interact with the ads to engage in commercial eling placed on the right. Since the content of this page describes activities. In the pricing model of Cost Per Click( PC), also known reasons for cancelling a trip, these traveling ads are unlikely to as Pay Per Click(PPC)(Feng, Bhargava, Pennock, 2003), advertis- be clicked. Instead, what the blogger needs is medical information ers pay every time a user clicks on their ads. They do not actually pay for placing the ads, but instead they pay only when the ads are or information on doctors. The point here is that an ad agency sys- clicked. This approach allows advertisers to refine search keywords tem could assign relevant ads according to bloggers'own interests especially their immediate interest or intentions, for targeting ads, and gain information about their market. Generally, user clicks thus treating bloggers as the main visitors of their own blogs. To system. a number of studies have suggested that strong relevance this end, even if an ad is related to the content of a linking page, increases the number of ad clicks( Chatterjee et al, 2003; OneUp an ad agency should preferentially consider the immediate inter ests expressed on the page (i.e intentions for placing ads b, 2005; Wang et al, 2002). Hence in th we refer to as Blogger-Centric Contextual Advertising (BCCA), and is determined by the ad 's relevance score with respect to the page. which is based on latent interest detection to associate ads with For simplicity, we ignore the positional effect of ad placement and blog pages. Instead of the traditional placement of relevant ads, BCCA emphasizes that the ad agency s system should provide rel 2007: Lacerda et al, 2006: Ribeiro-Neto, 2005). evant ads that are related to ditterent levels of personal preferences content creation but do not host their own web sites, service pro in order to increase clicks To evaluate our proposed method, we viders sometimes play the role of ad agency. For example, many used a real-word collection comprised of ads and blog pages from blog service or portal service providers(such as Facebook, My- oogle AdSense and Google's Blog-Search Engine, respectively Our results show that the proposed approach, based on text mining pace, and Twitter)also have their own ad agency systems, which im to generate profits while providing their services. As we realize can effectively recognize the latent interests(e.g. intentions)in a that blog owners are constant visitors of their own web pages blog page or the personal interests of the blogger. In addition, we further investigated the effects of ad page matching using bringing personal ads to bloggers becomes promising. since blog d Click-Through-Rate( CTR)experiment, and our results suggest blogs. That is, the ad agency system of blog service providers could that our proposed method can effectively match relevant ads to a given blog page. gain an advantage by providing the right message in the right con text at the right time to bloggers. Note that this does not preclud background infor introduces our me Patio organized as follows: Section 2 provides on current online ac Section 3 argeting ads to visitors, but rather highlights a chance to target ology. The experime ts are pre- advertising to bloggers themselves. After all, advertising is about nted in Section 4. Section 5 outlines some we present conclusions and future directions in Section 6 rk. Finally. text to the right person(Adams, 2004: Kazienko Adamski, 2007). To explore the possibility of targeting advertising to bloggers, we conducted a simple survey of 62 bloggers about their experi- 2. Background ence in clicking ads and what types of advertisements on their blogs do they prefer. Among 50 participants who had experience There are two main categories of text-based advertising: spon- clicking ads on their blogs, 40 of them indicated that they tend sored search(or keyword targeted marketing) and contextual adver- to click ads which are related eir interests and immediate requirements, while the other 10 participants randomly trigger 2http://blogsearch.google.com ads without consideration of the correlation between ads and their
written with positive sentiment produces high click-through rates. Although a page-relevant topic is a way to capture visitors’ interest, there is no other way to determine their personal interests. However, since bloggers are constant visitors to their own blogs, and their interests or intensions are well expressed in the weblogs, an ad agency could use those expressed intentions to place interest-oriented ads. For example, Fig. 1 shows a weblog with five ads related to traveling placed on the right. Since the content of this page describes reasons for cancelling a trip, these traveling ads are unlikely to be clicked. Instead, what the blogger needs is medical information or information on doctors. The point here is that an ad agency system could assign relevant ads according to bloggers’ own interests, especially their immediate interest or intentions, for targeting ads, thus treating bloggers as the main visitors of their own blogs. To this end, even if an ad is related to the content of a linking page, an ad agency should preferentially consider the immediate interests expressed on the page (i.e., intentions) for placing ads. In this paper, we proposed an ad matching mechanism, which we refer to as Blogger-Centric Contextual Advertising (BCCA), and which is based on latent interest detection to associate ads with blog pages. Instead of the traditional placement of relevant ads, BCCA emphasizes that the ad agency’s system should provide relevant ads that are related to different levels of personal preferences in order to increase clicks. To evaluate our proposed method, we used a real-word collection comprised of ads and blog pages from Google AdSense and Google’s Blog-Search Engine,2 respectively. Our results show that the proposed approach, based on text mining can effectively recognize the latent interests (e.g., intentions) in a blog page, or the personal interests of the blogger. In addition, we further investigated the effects of ad page matching using an ad Click-Through-Rate (CTR) experiment, and our results suggest that our proposed method can effectively match relevant ads to a given blog page. The rest of this paper is organized as follows: Section 2 provides background information on current online advertising. Section 3 introduces our methodology. The experimental results are presented in Section 4. Section 5 outlines some related work. Finally, we present conclusions and future directions in Section 6. 2. Background There are two main categories of text-based advertising: sponsored search (or keyword targeted marketing) and contextual advertising (or content targeted advertising) (Anagnostopoulous et al., 2007; Broder, Fontoura, Josifovski, & Riedel, 2007). Sponsored search, which delivers ads to users based on users’ input query, can be used on sites with a search interface (e.g., search engines). Contextual advertising, on the other hand, is displayed on general web sites. These two techniques differ in that a sponsored search analyzes only the user’s query key words, while content-based advertising parses the contents of a web page to decide which ads to show. However, the goals of each approach are consistent. The intent is to create a triple-win commercial platform. In other words, an advertiser pays a low price to purchase valuable advertisements, the ad agency system shares advertising profits with the web site owner (the publisher), and consumers can easily respond to ads to purchase products or services. Contextual advertising involves an interaction between four players (Anagnostopoulous et al., 2007; Broder et al., 2007). The publisher, or the owner of a web site, usually provides interesting pages on which ads are shown. The publishers typically aim to engage a viewer, encouraging them to stay on their web page and, furthermore, attracting sponsors to place their ads on the page. The advertiser (the second player) supplies a series of ads to market or promote their products. The advertisers register certain characteristic keywords to describe their products or services. The ad agency system (the third player) is a mediator between the advertisers and the publishers; that is, it is in charge of matching ads to pages. The end user (the fourth player), who browses web pages, might interact with the ads to engage in commercial activities. In the pricing model of Cost Per Click (CPC), also known as Pay Per Click (PPC) (Feng, Bhargava, & Pennock, 2003), advertisers pay every time a user clicks on their ads. They do not actually pay for placing the ads, but instead they pay only when the ads are clicked. This approach allows advertisers to refine search keywords and gain information about their market. Generally, user clicks generate profits for both web site publishers and the ad agency system. A number of studies have suggested that strong relevance increases the number of ad clicks (Chatterjee et al., 2003; OneUpWeb, 2005; Wang et al., 2002). Hence, in this study, we similarly assume that the probability of a click for a given ad on a given page is determined by the ad’s relevance score with respect to the page. For simplicity, we ignore the positional effect of ad placement and pricing models, as in (Anagnostopoulous et al., 2007; Broder et al., 2007; Lacerda et al., 2006; Ribeiro-Neto, 2005). For many web 2.0 services, where publishers are responsible for content creation but do not host their own web sites, service providers sometimes play the role of ad agency. For example, many blog service or portal service providers (such as Facebook, MySpace, and Twitter) also have their own ad agency systems, which aim to generate profits while providing their services. As we realize that blog owners are constant visitors of their own web pages, bringing personal ads to bloggers becomes promising, since bloggers’ profiles, opinions, short-term interests are expressed in their blogs. That is, the ad agency system of blog service providers could gain an advantage by providing the right message in the right context at the right time to bloggers. Note that this does not preclude targeting ads to visitors, but rather highlights a chance to target advertising to bloggers themselves. After all, advertising is about delivering the right message at the right time and in the right context to the right person (Adams, 2004; Kazienko & Adamski, 2007). To explore the possibility of targeting advertising to bloggers, we conducted a simple survey of 62 bloggers about their experience in clicking ads and what types of advertisements on their blogs do they prefer. Among 50 participants who had experience clicking ads on their blogs, 40 of them indicated that they tend to click ads which are related to their interests and immediate requirements, while the other 10 participants randomly trigger ads without consideration of the correlation between ads and their Fig. 1. Example of a blog page with correlation ads. 2 http://blogsearch.google.com. 1778 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788
T.-K. Fan, C-H Chang/ Expert Systems with Applications 38(2011)1777- 1779 interests. From this survey, we could say that 80% of bloggers tend to recognize intention and detect sentiment for triggering-level to click the ads on their blogs and about 80% of ads click rates are interests. If no such targets are found, the system uses targets from related to personal interests and requirements. Thus, in this paper, the blogger's profile and searches the ad database to find the best we propose the idea of user-centric contextual advertising and use matching ads. These four modules: intention recognition, senti- the blogosphere as an example for realizing this idea. Compared ment detection, term expansion and target-ad matching are the with the traditional ad agency, which targets general visitors, blog- main components in our BCCA framework. The first two com ger-centric advertising considers bloggers themselves, as their ads nents analyze triggering-level interests, while the last two compo- targets and display ads based on bloggers'interest and intentions nents enhance the target-ad matching procedure. The connections as described below between modules are depicted in Fig. 2. Note that we place a priority on ad assignment based on differ 3. BCCA framework ent levels of interests. That is, triggering-level interests (i.e, short term interests) have higher priority than profile-level interests Traditional contextual advertising processes a given page a user (long-term interests). Thus, the sentiment detection module is in- visits to find related topics for matching ads, while blogger-Centric roked only when no positive intention is detected. If neither mod- ule detects intention or positive sentences, the system should then accordance with the blogger 's interests. Before demonstrating the use targets from the blogger's profile roposed framework, we will explain how bloggers'personal inter Next, the system proceeds to term expansion and the target-ad ests could be obtained. Generally speaking, an individual blog often matching agency. Due to the short form of ads and targets, we de- contains profile information, tags and posts, which we could clas- signed a term expansion component to enhance the likelihood of ify as indicating different levels of interest. intersection between targets and available ads. Finally, a retrieval function based on a query likelihood language model is deployed 3. 1. Profile-level for the target-ad matching strategy to rank the ads. the pseudo code for our ad assignment strategies is shown in Fig 3. Blog service providers (e.g. BlogSpot. com and Technorati. com) usually ask bloggers'to enter interests to build their profiles 3.3. Intention recognition (e.g, music, movies, reading, and other leisure pursuits) when they register as a member of the service. In addition to generic interests Given a triggering page, our aim in this section is to explain the collected at registration time, the tags on posts or the archive of process of recognizing whether there exist any intention-bearing past posts can be used to construct bloggers'profiles, showing their sentences. By modeling this problem one of classification, our job specific interests. Since these kinds of interest continue for a period here is the preparation of training sentences which must be labeled of time, we view them as long-term interests. In this paper, we as- as intentional or non-intentional. umed that each blog-site has an interest profile containing either generic or specific long-term interests. 3.3.1. Collecting data for classifier training Labeling each sentence as intentional or non-intentional is a 3. 2. Triggering-level time-consuming and costly task. In this study we propose a novel og posts are media in which bloggers express their opinions and interests, as well as their intensions. For example, the sen- tence, The Nokia N95 is a good cell phone for several reasons, "ex a blog post s the sentiments of the author toward the object in question, a Nokia N95. The target is not necessarily a named entity(e. g, the name of a person, location, or organization) but it can also be a Does this ncept (such as a type of technology ) a product name, or an ent. Finding such targets is one of the key components for tra- ditional contextual advertising. For Blogger-Centric Contextual Advertising, we argue that recognizing the intentions of authors ould be even more effective. For instance. consider the sentence Detection We re going to the doctor right now. "Fig. 1 indicates that the author has an immediate intention to see a doctor. As another example the sentence, " I am looking for a new laptop, " implies that the through Another consideration for blogger-centric advertising is whether the sentence presents negative sentiments. For example, the phrase, " canceling trip to Europe, " in Fig. 1 shows a negatively connotated target, which has a lower priority. As demonstrated in( Fan Chang, 2009), avoiding negative targets provides better contextual ads. Thus, we could say that their work is actually a Page-Ad pecial case of Blogger-Centric Contextual Advertising that aims Matching Collection at providing ads to the bloggers. our Blogger-Centric Contextual Advertising framework (BCCA), the advertising system analyzes the content of the page 3http:/trec.nist.gov. Fig. 2. The BCCA framework
interests. From this survey, we could say that 80% of bloggers tend to click the ads on their blogs and about 80% of ads’ click rates are related to personal interests and requirements. Thus, in this paper, we propose the idea of user-centric contextual advertising and use the blogosphere as an example for realizing this idea. Compared with the traditional ad agency, which targets general visitors, blogger-centric advertising considers bloggers themselves, as their ads targets and display ads based on bloggers’ interest and intentions as described below. 3. BCCA framework Traditional contextual advertising processes a given page a user visits to find related topics for matching ads, while Blogger-Centric Contextual Advertising would assign ads to a given blog page in accordance with the blogger’s interests. Before demonstrating the proposed framework, we will explain how bloggers’ personal interests could be obtained. Generally speaking, an individual blog often contains profile information, tags and posts, which we could classify as indicating different levels of interest. 3.1. Profile-level Blog service providers (e.g., BlogSpot.com and Technorati.com) usually ask bloggers’ to enter interests to build their profiles (e.g., music, movies, reading, and other leisure pursuits) when they register as a member of the service. In addition to generic interests collected at registration time, the tags on posts or the archive of past posts can be used to construct bloggers’ profiles, showing their specific interests. Since these kinds of interest continue for a period of time, we view them as long-term interests. In this paper, we assumed that each blog-site has an interest profile containing either generic or specific long-term interests. 3.2. Triggering-level Blog posts are media in which bloggers express their opinions and interests, as well as their intensions. For example, the sentence, ‘‘The Nokia N95 is a good cell phone for several reasons,” express the sentiments of the author toward the object in question, a Nokia N95. The target is not necessarily a named entity (e.g., the name of a person, location, or organization) but it can also be a concept (such as a type of technology), a product name, or an event.3 Finding such targets is one of the key components for traditional contextual advertising. For Blogger-Centric Contextual Advertising, we argue that recognizing the intentions of authors could be even more effective. For instance, consider the sentence, ‘‘We’re going to the doctor right now.” Fig. 1 indicates that the author has an immediate intention to see a doctor. As another example, the sentence, ‘‘I am looking for a new laptop,” implies that the author probably will purchase a laptop. As such, targets are immediate interests; ads centered around them might increase clickthrough. Another consideration for blogger-centric advertising is whether the sentence presents negative sentiments. For example, the phrase, ‘‘canceling trip to Europe,” in Fig. 1 shows a negativelyconnotated target, which has a lower priority. As demonstrated in (Fan & Chang, 2009), avoiding negative targets provides better contextual ads. Thus, we could say that their work is actually a special case of Blogger-Centric Contextual Advertising that aims at providing ads to the bloggers. In our Blogger-Centric Contextual Advertising framework (BCCA), the advertising system analyzes the content of the page to recognize intention and detect sentiment for triggering-level interests. If no such targets are found, the system uses targets from the blogger’s profile and searches the ad database to find the best matching ads. These four modules: intention recognition, sentiment detection, term expansion and target-ad matching are the main components in our BCCA framework. The first two components analyze triggering-level interests, while the last two components enhance the target-ad matching procedure. The connections between modules are depicted in Fig. 2. Note that we place a priority on ad assignment based on different levels of interests. That is, triggering-level interests (i.e., shortterm interests) have higher priority than profile-level interests (long-term interests). Thus, the sentiment detection module is invoked only when no positive intention is detected. If neither module detects intention or positive sentences, the system should then use targets from the blogger’s profile. Next, the system proceeds to term expansion and the target-ad matching agency. Due to the short form of ads and targets, we designed a term expansion component to enhance the likelihood of intersection between targets and available ads. Finally, a retrieval function based on a query likelihood language model is deployed for the target-ad matching strategy to rank the ads. The pseudo code for our ad assignment strategies is shown in Fig. 3. 3.3. Intention recognition Given a triggering page, our aim in this section is to explain the process of recognizing whether there exist any intention-bearing sentences. By modeling this problem one of classification, our job here is the preparation of training sentences which must be labeled as intentional or non-intentional. 3.3.1. Collecting data for classifier training Labeling each sentence as intentional or non-intentional is a time-consuming and costly task. In this study, we propose a novel A blog post Sentiment Detection Intention Recognition Page-Ad Matching A list of personal ads Term Expansion Does this post contain any sentences of positive intention Does this post contain any sentences of positive sentiment yes yes no no Profile Ad Collection Fig. 2. The BCCA framework. 3 http://trec.nist.gov. T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788 1779
T.-K Fan, C-H Chang/ Expert Systems with Applications 38(2011 )1777-1 Algorithm: Ad assignment Strategies Table 1 Contingency table for chi-square Input: a blog post P, a profile set an ads collection Output: blogger-centric ads BCa that are related to on-intentior nal Noi Noo personal interests. olumn NI1+ No N1o+ Noo N= Nu+ No +N1o+ Noo 1. Positive Intention Pl Recognition O 2. Target(s)T=extract target(s)from PI //wher target(s) are depicted as noun(s) ntentional training data (i.e. negative instances), we manually 3. if T=(p construct some queries (e.g, product names, people names, and 4. then Positive Sentiment PS Detection O proper nouns)that will collect entry pages from Wikipedia. We 5. T-extract target(s) from PS hose Wikipedia as our non-intentional training data source be- fT=中 cause it usually describes facts about a specific object and avoids individual subjective intentions and 7. T= target(s)from I 8. end if 9. Tis expanded by Term Expansion O 3.3. 2. Feature selection and feature value 10. Assigning ads which are related to T For feature selection, we considered a subset of word unigrams chosen via the Pearson chi-square test( Chernoff 1954). Yang and Pedersen(1997) suggest that the chi-sqi 四c Fig 3. Pseudo code of ads assignment strategies. is an effective approach to feature selection. To find dependent a feature f is with respect to the intention set or the non-intention set, we set up a null hypothesis that f is independent Home > Buy Want It Now Mobile s Home Phones Mobile Phone of the two categories with respect to its occurrences in the two G600 or any 5 megapixel camera phone sets. A Pearson chi-square test compares observed frequencies of Description f to its expected frequencies to test this hypothesis according to a contingency table, as shown in Table 1. The Ni in Table 1 is counted as the number of sentences containing (or not containing) am looking for a Coach style faceplate to fit a fin the intentional (or non-intentional)dataset. The independence Motorola Razrv3xx phone of f is tested by calculating its chi-square value it's the one with the" c on it Dam wanting the brown color if possible or the black. x2()= S(Ni-Eij) where Ey is the expected frequency of case i calculated by Fig 4. Example of a post with buyer's requirements E1=m则xOmn,j∈0.1 technique to semi-automatically label training data for this task A high chi-square value indicates that the hypothesis of inde Sincemanyauctionwebsites(e.gebay.comandyahoo.com)pro-pendencewhichimpliesthatexpectedandobservedcountsare vide special forums for buyers to post their needs, it naturally be- Similar, is incorrect. In other words, the larger the chi-square value comes a source of intention- filled sentences. Fig. 4 shows an the more class-dependent f is with respect to the intentional set or example of such a post. In this study we collected a large set of posts containing bu juirementsfromebay.com.foreachthenon-intentionalsetInthisstudyweselectedthetop-kfwitha uyers post BP, we extracted the content for the Description field high chi-square value as input feature by a simple program mainly coded with regular expressions. How- used the standard bag-of-features framework. Let ever,since many buyers describe their requirements(e.g- product predefined set of m features that can appear in a document. Let name,productt chat do not contain intentions by the following wi(d) be the weight of f as it occurs in document d Then each pe) in a concise sentence, the labeling system filters document d is criteria ented by the document vector d=(w(d) W2(d)., Wm(d)). As for the weighting value, it can be assigned as Part-of-Speech(POS) tag: Since an intentional sentence usually either boolean value or as a t-idf (term requency inverse doc contains a verb that is not a form of"to be", we keep candidate ument frequency)value. Here we used tf-idf, which is a statistical sentences that contain terms tagged as verbs(e.g, VB and vBG) measure that evaluates how important a word is to a document. nd remove sentences that contain only noun phrases(nn he tf-idf function assumes that the more frequently a certain term NNS). For example, the second and fourth sentences shown in t occurs in documents d the more important it is for d, and fur thermore the more documents d that term ti occurs in, the smaller Fig 4 are two useful sentences for training data, while the sen- its contribution is in characterizing the semantics of a document in tence. It's the one with the"C" on it, would be discarded. The length of the sentence: Short polite words are usually used in which it occurs. In addition, weights computed by tf-idf techniques the forums. Hence, we simply disregard sentences whose re often normalized so as to counter the tendency of tf-idf to emphasize long documents. The type of tf-idf that we used to gel lengths are less than three words. For example, the first and erate normalized weights for data representations in this study is the last sentences presented in Fig 4 would be neglected. All the candidate sentences that conform to the above rules are tf-idf =tf( i, d;).log- DI regarded as intentional data (i.e, positive instances ). As for non- http://pages.ebay.co.uk/wantitnow/
technique to semi-automatically label training data for this task. Since many auction web sites (e.g., ebay.com and yahoo.com) provide special forums for buyers to post their needs, it naturally becomes a source of intention-filled sentences. Fig. 4 shows an example of such a post. In this study, we collected a large set of posts containing buyers’ requirements from ebay.com. 4 For each buyer’s post BP, we extracted the content for the Description field by a simple program mainly coded with regular expressions. However, since many buyers describe their requirements (e.g., product name, product type) in a concise sentence, the labeling system filters simple sentences that do not contain intentions by the following criteria. Part-of-Speech (POS) tag: Since an intentional sentence usually contains a verb that is not a form of ‘‘to be”, we keep candidate sentences that contain terms tagged as verbs (e.g., VB and VBG), and remove sentences that contain only noun phrases (NN & NNS). For example, the second and fourth sentences shown in Fig. 4 are two useful sentences for training data, while the sentence, ‘‘It’s the one with the ‘‘C” on it,” would be discarded. The length of the sentence: Short polite words are usually used in the forums. Hence, we simply disregard sentences whose lengths are less than three words. For example, the first and the last sentences presented in Fig. 4 would be neglected. All the candidate sentences that conform to the above rules are regarded as intentional data (i.e., positive instances). As for nonintentional training data (i.e., negative instances), we manually construct some queries (e.g., product names, people names, and proper nouns) that will collect entry pages from Wikipedia.5 We chose Wikipedia as our non-intentional training data source because it usually describes facts about a specific object and avoids individual subjective intentions and opinions (Zhang, Yu, & Meng, 2007). 3.3.2. Feature selection and feature value For feature selection, we considered a subset of word unigrams chosen via the Pearson chi-square test (Chernoff & Lehmann, 1954). Yang and Pedersen (1997) suggest that the chi-square test is an effective approach to feature selection. To find out how dependent a feature f is with respect to the intention set or the non-intention set, we set up a null hypothesis that f is independent of the two categories with respect to its occurrences in the two sets. A Pearson chi-square test compares observed frequencies of f to its expected frequencies to test this hypothesis according to a contingency table, as shown in Table 1. The Nij in Table 1 is counted as the number of sentences containing (or not containing) f in the intentional (or non-intentional) dataset. The independence of f is tested by calculating its chi-square value x2ðfÞ ¼ X i2f0;1g X j2f0;1g ðNij EijÞ 2 Eij where Eij is the expected frequency of case ij calculated by Eij ¼ rowi columnj N ; i; j 2 f0; 1g A high chi-square value indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect. In other words, the larger the chi-square value, the more class-dependent f is with respect to the intentional set or the non-intentional set. In this study, we selected the top-K f with a high chi-square value as input features. To apply these machine learning algorithms on our dataset, we used the standard bag-of-features framework. Let {f1,...,fm} be a predefined set of m features that can appear in a document. Let wi(d) be the weight of fi as it occurs in document d. Then each document d is represented by the document vector ~d ¼ ðw1ðdÞ; w2(d),...,wm(d)). As for the weighting value, it can be assigned as either a boolean value or as a tf–idf (term frequency – inverse document frequency) value. Here we used tf–idf, which is a statistical measure that evaluates how important a word is to a document. The tf–idf function assumes that the more frequently a certain term ti occurs in documents dj, the more important it is for dj, and furthermore, the more documents dj that term ti occurs in, the smaller its contribution is in characterizing the semantics of a document in which it occurs. In addition, weights computed by tf–idf techniques are often normalized so as to counter the tendency of tf–idf to emphasize long documents. The type of tf–idf that we used to generate normalized weights for data representations in this study is tf — idf ¼ tfðti; djÞ log jDj #DðtiÞ Table 1 Contingency table for chi-square. f :f Row Intentional set N11 N10 N11 + N10 Non-intentional set N01 N00 N01 + N00 Column N11 + N01 N10 + N00 N = N11 + N01 + N10 + N00 Fig. 3. Pseudo code of ads assignment strategies. Hi, I am looking for a Coach style faceplate to fit a Motorola RazrV3xx phone. It's the one with the "C" on it. I am wanting the brown color if possible or the black. thanks! Hi, I am looking for a Coach style faceplate to fit a Motorola RazrV3xx phone. It's the one with the "C" on it. I am wanting the brown color if possible or the black. thanks! Fig. 4. Example of a post with buyer’s requirements. 4 http://pages.ebay.co.uk/wantitnow/. 5 http://en.wikipedia.org. 1780 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788
T.-K. Fan, C.-H. Chang/Expert Syst Applications 38(2011)1777-1788 1 81 where the factor tft, d) is called the term frequency, the factor log t e and negative sentences are similarly represented as tis called the inverse document frequency, while #D () denotes feature-presence vectors to train the classifiers the number of documents in the document collection D in which term ti occurs at least once and 3.4.2.Term expansion (-)>0 In general, a blog page can be about any theme, while the adver- tisements are concise in nature. Hence, the intersections of terms between ads and pages are very low. If we only consider the exist- where #(t. d denotes the frequency of in dj. Weights obtained ing terms included in a trigg agency may not accu- rately retrieve from the tf-idf function are then normalized by means of cosine this paper, w d to a page. In normalization, finally yielding oosed in (Fan& Chang in a blog page According ch, Josifovski Wij= tfidf(ti,) and Riedei nd Moura tfidf(,)? (2005)cor to perforn 3.4. Sentiment detection Fan Our aim in this section is to apply a contextual sentiment and Chang nique for recognizing the sentiment of a blog page. Generally. NNS) as candic set of seed terms accordin lized determines Hovy. 2006: Ku, Ho, Che word. art of the nchor text of a positive steps, each e considered the s as a subset the sente tion step that - 3.4.1. Collecti s that utilize tively. The third nions.co es related to a trig- ing the specific terms etails, readers are referred to (Fan tising issue that is, given a trieval) system returns rele ntent. Hence, we intu- as a user's nis pa- s our ad re- lodels the if the Ponte P(alq), where od that it is by Bayes rules ment labels )is the same weight for all ads and the prior probabil- statistic metho usually treated as uniform across all a, both of identification tion step, all th sented as feat http: /en.wikipedia.org
where the factor tf(ti,dj) is called the term frequency, the factor log jDj #DðtiÞ is called the inverse document frequency, while #D (ti) denotes the number of documents in the document collection D in which term ti occurs at least once and tfðti; djÞ ¼ 1 þ log #ðti; djÞ; if #ðti; djÞ > 0 0 otherwise where #(ti, dj) denotes the frequency of ti in dj. Weights obtained from the tf–idf function are then normalized by means of cosine normalization, finally yielding wi;j ¼ tfidfðti; djÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PjTj k¼1tfidfðts; djÞ 2 q 8 >: 3.4. Sentiment detection Our aim in this section is to apply a contextual sentiment technique for recognizing the sentiment of a blog page. Generally, researchers study opinions at three different levels: the word level, sentence level, and document level (Esuli & Sebastiani, 2006; Kim & Hovy, 2006; Ku, Ho, & Chen, 2009; Yu & Hatzivassiloglou, 2003). For this study, we need to identify whether a sentence is neutral, positive or negative. To build an efficient learning model, we divided the task of detecting the sentiment of a sentence into two steps, each of which belongs to a binary classification problem. The first is an identification step that aims to identify whether the sentence is subjective or objective. The second is a classification step that classifies the subjective sentences as positive or negative. 3.4.1. Collecting data for classifier training For sentiment classification, a good Web resource is epinions.com, where pros and cons of products or other topics are discussed by reviewers. Such information is thus used by Kim and Hovy (2006) to prepare their training data. The pro and con fields in epinions.com contain comma-delimited phrases which describes the features of the products. Thus, Kim and Hovy use these two sets of pro and con phrases to label the orientation of sentences in the review document. A sentence is annotated as positive if it contains pro phrases. A sentence is annotated as negative if it contains con phrases. Otherwise, a sentence is labeled as neutral. They use these data and a learning algorithm with different feature categories (such as the unigram and the opinion-bearing word) to train a pro and con sentence recognition system. Although this method is fully automatic, such an idea based on pro and con phrases does not perform well (F-measure: 0.65). As indicated in Esuli and Sebastiani (2006), opinionated content is most often carried by parts of speech used as modifiers (i.e., adverbs and adjectives) rather than parts of speech used as heads (i.e., verbs, nouns), as exemplified by expressions such as a funny movie or a fabulous game. Thus, we combined the concept proposed in (Esuli & Sebastiani, 2006) and modified Kim and Hovy’s (2006) method as follows. For each term tagged as adjective or adverb (e.g., JJ and RB) in pro and con sets, our labeling system checks each sentence to find sentences that contain those adjectives or adverbs. Then the system annotates each sentence with appropriate sentiment labels. As for feature selection, we similarly adopted the chi-square statistic method to select the top-K features for both the sentiment identification and sentiment classification steps. For the identification step, all the subjective and objective sentences are represented as feature-presence vectors, where the presence or absence of each feature is recorded. For the classification step, all the positive and negative sentences are similarly represented as feature-presence vectors to train the classifiers. 3.4.2. Term expansion In general, a blog page can be about any theme, while the advertisements are concise in nature. Hence, the intersections of terms between ads and pages are very low. If we only consider the existing terms included in a triggering page, an ad agency may not accurately retrieve relevant ads, even when an ad is related to a page. In this paper, we followed three term expansion methods proposed in (Fan & Chang, 2009) to expand the specific terms in a blog page. According to Anagnostopoulous, Broder, Gabrilovich, Josifovski, and Riedei (2007) Ribeiro-Neto, Cristo, Golgher, and Moura (2005), considering the ads’ abstracts and titles is not sufficient to perform page-ad matching. Thus, term expansion of the keywords in the triggering page, as well as in the ads, is conducted to increase overlap. For a triggering page, because not all the words included in a page are useful for carrying out term expansion, Fan and Chang (2009) simply take the terms tagged as nouns (NN & NNS) as candidate terms from which they generated a set of seed terms according to the following rules: TCapitalization: whether a candidate term is capitalized determines whether it is a proper noun, or an important word. Thypertext: whether a candidate term is part of the anchor text of a hypertext link. Ttitle: whether a candidate term is part of the post’s title. Tfrequency: consistently with term frequency, we considered the three most frequently occurring candidate terms as a subset of the seed terms. Subsequently, the set of seed terms (SeedTerm = TCapitalization [ Thypertext [ Ttitle [ Tfrequency) undergoes three term expansion methods. Two methods are dictionary-based operations that utilize the WordNet6 and Wikipedia7 thesauruses, respectively. The third method is a web-based search that identifies pages related to a triggering page to construct a co-occurrence list using the specific terms on a triggering page. For more details, readers are referred to (Fan & Chang, 2009). 3.4.3. Page-ad matching We can regard the Blogger-Centric Contextual Advertising issue as a traditional information retrieval problem: that is, given a user’s query q, the IR (Information Retrieval) system returns relevant documents d according to the query content. Hence, we intuitively model a triggering page p and relevant ads a as a user’s query q and corresponding documents d, respectively. In this paper, the query likelihood language model is adopted as our ad retrieval model. The language modeling approach to IR models the following idea: A document d is a good match to a query q if the document model is likely to generate query q, which will in turn happen if the document contains the query words often (Ponte & Croft, 1998). Hence, we constructed from each ad a in the collection a language model Ma. Our goal is to rank ads by P(ajq), where the probability of an ad a is interpreted as the likelihood that it is relevant to the query q. The ranking function P(ajq) by Bayes rules can be converted to PðajqÞ ¼ PðaÞPðqjaÞ PðqÞ Since P(q) is the same weight for all ads and the prior probability of an ad P(a) is usually treated as uniform across all a, both of them can be ignored. 6 http://wordnet.princeton.edu/. 7 http://en.wikipedia.org. T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788 1781
T.-K Fan, C-H Chang/ Expert Systems with Applications 38(2011 )1777-1 The language modeling approach attempts to model the query study, we collected many different categories of buyers' posts generation process. That is, ads are ranked by the probability that(e.g, antiques, books, sports, home, health, music, photography a query q would be observed as a random sample from the respec and travel) and general entry pages of Wikipedia, respectively. tive ad model By using the multinomial unigram language model, For sufficient detail, we first collected 36, 151 buyer's posts and we get that then randomly examined 10,000 sentences according to the rules P(qIMa)=P(tMa)ap discussed in Section 3. 1.1. For non-intentional training data, we similarly examined 10,000 sentences from the wikipedia dataset. Additionally, to evaluate sentiment detection performance, we col- where tf ta) is the term frequency of the term t in advertisement a. lected data from epinions. com and used this as our training dataset The probability of producing the query given the language mod- for building learning classifier models We gathered many different el Ma of advertisement a can be estimated by using a maximum types of reviews from epinion. com(e. g 3C products, hotels, movies, likelihood estimation(MLe) and the unigram assumption travel, theme parks, and second-hand cars). For sufficient detail, we examined about 30,000 reviews(about 900,000 sentences), P(qIMa)=Il PMLE(Ma)=II. with an average number of 29sentences per review document. For page-ad matching, because of the lack of large where Na is the length of a Since terms abases, we first chose certain general topic words (e.g, alcohol in advertisements, zero probabilities ma appear very parse book, clothes, cosmetics, culture, game, laptop, medicin dicting the next word (ie, P(alMa) se problems in pre- phone, and sport)as query terms to request web pages from search we introduced Bayesian smoothing mechanism using a t prior(Zhai Laff engines such Google and Yahool. About 10,000 pages were re- erty, 2004)in our ad language model Ma prob trieved and we placed these pages on an ad-crawler platform to abilities and weight some likelihoods mass to unseen words obtain the corresponding ads assigned by Google AdSense. Our ad-crawler was similar to a generic blog web site that can be embedded in a JavaScript module(e.g, Google AdSense). We first p(tMa)= extracted the content of each retrieved page(about 10,000 pages and then regarded this content as a blog post in order to get the where p(t Ma)is the maximum likelihood estimation of term t in ad corresponding ads assigned by the Google AdSense JavaScript a, and pMle(tIMe)is the maximum likelihood estimation of term t in module. Then, we extracted the ads with a simple program mainly the entire ad collection. u is the Dirchlet prior: we used a fixed value coded with regular expressions. In addition to our ad-crawler sys- with !=2000 according to our experiment tem, we also used the existing ads recommendation system to in- According to the above description of the language model, the crease the quantity of our ad collection. These two ad-collecting retrieval ranking for a query q under the advertisement language methods are reasonable ways to collect real-world advertisements; model Ma for page-ad matching is given by we collected a total of 138, 907 ads Our triggering page collection was comprised of 200 blog pages on various topics. It included a Paoa(m+正k range of opinions and was comprised of various subjective articles (100 positive and 50 negative articles)and 50 general articles includ- ing intentional sentences. We selected opinion-triggering pages according to the ratio of positive and negative sentences: that is, if 3.5. Ad content indexing the ratio of positive to negative sentence was over 4: 1, we regarded In the section 3. 4. we disci scoring function for an ad pages, we manually selected the article which contained at least given the triggering blog page one intentional sentence. In addition to triggering pages. areas agd ead select an s te be sose at rentrevra tess ea sd therfore ally constructed user profiles for each triggering page. Since not all of must be very efficient. were substituted for user profiles when user profiles were We adopted a basic inverted index framework including pe available ings and dictionary, where there is one posting list for each distinct To acquire the pOs tag, we adopted the gEnia tagger developed term. The ad contents are tokenized into a list of terms via linguis- by the University of Tokyo. In addition, we preprocessed the full ic preprocessing, including stemming and stop word removing, to text of a triggering page with its expanded terms(as discussed in produce a list of normalized tokens called indexing terms. For each Section 3.3) and the ads(including the full text of the landing page dexing term, we had a list that records which ads and its abstract) with their expanded ter erms by removing sto in, and its associated weight(tf* idf value). The list is then called a words and one-character words, followed by stemming(Porter, stings list. The postings list contains one entry per indexing term/ 1980) Experimental results 4.2. Evaluation of intention recognition In this section, we focus on our three experiments: intention We experimented with three standard algorithms: Naive bayes recognition, sentiment detection and page-ad matching We begin classification, a decision tree, and a support vector machine (SvM). by describing the dataset and text-preprocessing, and we then pro- Due to space limitations, we only presented the best performance eed to a discussion of the experimental results. generated by the svm classifier in this paper. The goals in this se tion are to explore how well the intention recognition model per- 4. 1. Datasets and text-preprocessing formed with different approaches on the data collected from ebay. com and wikipedia and to investigate how well the trained To evaluate in nition perfo
The language modeling approach attempts to model the query generation process. That is, ads are ranked by the probability that a query q would be observed as a random sample from the respective ad model. By using the multinomial unigram language model, we get that PðqjMaÞ ¼ Y t2q PðtjMaÞ tfðt;aÞ where tf(t,a) is the term frequency of the term t in advertisement a. The probability of producing the query given the language model Ma of advertisement a can be estimated by using a maximum likelihood estimation (MLE) and the unigram assumption bPðqjMaÞ ¼ Y t2q bPMLEðtjMaÞ ¼ Y t2q tfðt;aÞ Na where Na is the length of a. Since terms generally appear very sparse in advertisements, zero probabilities maybe cause problems in predicting the next word (i.e., bPðqjMaÞ ¼ 0Þ. Hence, we introduced a Bayesian smoothing mechanism using a Dirichlet prior (Zhai & Lafferty, 2004) in our ad language model Ma to discount nonzero probabilities and weight some likelihood’s mass to unseen words bpðtjMaÞ ¼ tfðt;aÞ þ lbPMLEðtjMcÞ Na þ l where bpðtjMaÞ is the maximum likelihood estimation of term t in ad a, and bpMLEðtjMc Þ is the maximum likelihood estimation of term t in the entire ad collection. l is the Dirchlet prior; we used a fixed value with l = 2000 according to our experiments. According to the above description of the language model, the retrieval ranking for a query q under the advertisement language model Ma for page-ad matching is given by: PðajqÞ / Y t2q tfðt;aÞ þ lbPMLEðtjMcÞ Na þ l ! 3.5. Ad content indexing In the Section 3.4, we discussed the scoring function for an ad given the triggering blog page. The top-k ads with the highest score are assigned by the ad agency system. The process of score calculation and ad selection is to be done at retrieval time and therefore must be very efficient. We adopted a basic inverted index framework including postings and dictionary, where there is one posting list for each distinct term. The ad contents are tokenized into a list of terms via linguistic preprocessing, including stemming and stop word removing, to produce a list of normalized tokens called indexing terms. For each indexing term, we had a list that records which ads the term occurs in, and its associated weight (tf idf value). The list is then called a postings list. The postings list contains one entry per indexing term/ ad combination. 4. Experimental results In this section, we focus on our three experiments: intention recognition, sentiment detection and page-ad matching. We begin by describing the dataset and text-preprocessing, and we then proceed to a discussion of the experimental results. 4.1. Datasets and text-preprocessing To evaluate intention recognition performance, we retrieved data from ebay.com and wikipedia.com and then used these as our training dataset for building our learning classifier model. In this study, we collected many different categories of buyers’ posts (e.g., antiques, books, sports, home, health, music, photography, and travel) and general entry pages of Wikipedia, respectively. For sufficient detail, we first collected 36,151 buyer’s posts and then randomly examined 10,000 sentences according to the rules discussed in Section 3.1.1. For non-intentional training data, we similarly examined 10,000 sentences from the Wikipedia dataset. Additionally, to evaluate sentiment detection performance, we collected data from epinions.com and used this as our training dataset for building learning classifier models. We gathered many different types of reviews from epinion.com(e.g., 3C products, hotels, movies, travel, theme parks, and second-hand cars). For sufficient detail, we examined about 30,000 reviews (about 900,000 sentences), with an average number of 29sentences per review document. For page-ad matching, because of the lack of large-scale ad databases, we first chose certain general topic words (e.g., alcohol, book, clothes, cosmetics, culture, game, laptop, medicine, mobile phone, and sport) as query terms to request web pages from search engines such Google and Yahoo!. About 10,000 pages were retrieved and we placed these pages on an ad-crawler platform to obtain the corresponding ads assigned by Google AdSense. Our ad-crawler was similar to a generic blog web site that can be embedded in a JavaScript module (e.g., Google AdSense). We first extracted the content of each retrieved page (about 10,000 pages) and then regarded this content as a blog post in order to get the corresponding ads assigned by the Google AdSense JavaScript module. Then, we extracted the ads with a simple program mainly coded with regular expressions. In addition to our ad-crawler system, we also used the existing ads recommendation system8 to increase the quantity of our ad collection. These two ad-collecting methods are reasonable ways to collect real-world advertisements; we collected a total of 138,907 ads. Our triggering page collection was comprised of 200 blog pages on various topics. It included a range of opinions and was comprised of various subjective articles (100 positive and 50 negative articles) and 50 general articles including intentional sentences. We selected opinion-triggering pages according to the ratio of positive and negative sentences: that is, if the ratio of positive to negative sentence was over 4:1, we regarded a triggering page as positive, and vice versa. For intentional trigger pages, we manually selected the article which contained at least one intentional sentence. In addition to triggering pages, we manually constructed user profiles for each triggering page. Since not all of the blog web sites provided profile information, the tags of archives were substituted for user profiles when user profiles were unavailable. To acquire the POS tag, we adopted the GENIA Tagger developed by the University of Tokyo.9 In addition, we preprocessed the full text of a triggering page with its expanded terms (as discussed in Section 3.3) and the ads (including the full text of the landing page and its abstract) with their expanded terms by removing stop words and one-character words, followed by stemming (Porter, 1980). 4.2. Evaluation of intention recognition We experimented with three standard algorithms: Naive Bayes classification, a decision tree, and a support vector machine (SVM). Due to space limitations, we only presented the best performance generated by the SVM classifier in this paper. The goals in this section are to explore how well the intention recognition model performed with different approaches on the data collected from ebay.com and Wikipedia and to investigate how well the trained 8 http://www.labnol.org/google-adsense-sandbox/. 9 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/. 1782 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788
T.-K. Fan, C-H Chang/ Expert Systems with Applications 38(2011)1777-1788 Table 2 Intention and non-intention sentence discovery results. Recall ( Unigram(45. 030) Intentional 91.5 93 925 Chi-square(10,000 Non-intentional 9g98 908 Need-bearing words (77) Non-intentional 94.0 Intension recognition by various feature sets. Feature set(# of features) Class cision(% Unigram(45. 030) B15 F-measure(%) Intentional 778 80 Chi-square(10,000) Need-bearing words (77) Non-intentional 94.1 918 4.6 67.6 model performed on a different data source(50 triggering pages Table 4 including intentional sentences). ubjective and objective sentence identification results. For the ebay. com and Wikipedia data, we adopted a fivefold cross validation mechanism for the learning algorithm. The preci Class recision call on(proportion of identified sentences that are marked as inten- Unigram(40, 626) Objective 85.5 sentences), recall(proportion of the marked intentional or non- Subjective 88.0 intentional sentences that are identified out of all marked inten Chi-square(1000) objective 79 85.8 tional or non-intentional sentences available) and F-measure Subjective (weighted harmonic mean of precision and recall) were used as Chi-square(3000) Objective 82.5 87 evaluation measures. Besides, we regarded the Svm with need Subjective 91.3 bearing words as our baseline system. Table 2 shows intention recognition results generated by the svm classifier with different Subjective 90.7 ure sets. As can be seen in this table, the best F-measure Opinion-bearin Objective 80.4 (94.0%) for the non-intentional class and the best F-measure Subjective 85.8 789822 (92.5%) for the intentional class performances appeared using the Baseline 726 SVM with need-bearing words and unigrams, respectively. More over, for the non-intentional class our results also suggest that there are no significant differences in terms of precision, recall and F-mea- sure among an SVM that used unigrams, features selected by chi- non-intentional class, there are no significant differences. How- square, and need-bearing words. As for the intentional class, using ever, for the intentional class, there are significant differences of unigrams and the chi-square can outperform the baseline system about 10% between ebay. com and triggering pages in terms of y y by about 20% in terms of F-measure. cision,recall, and F-measure. It seems reasonable to interpret that According to above results, for the dataset of triggering pages the data from ebay. com mainly describe the product purchasing (i.e. 50 articles including intentional sentences to a lack intention in a concise sentence and do not provide reasons why of training data, we subsequently chose lear ithms with they need this product(e.g, "I want to buy a phone because.") different feature sets to train the models on the Ind Wiki- However, our triggering pages selected from individual blogs not pedia datasets. We then applied these models to our triggering only contain personal purchasing intentions but provide in detail page dataset In order to obtain a reasonable evaluation, human ex- some reasons why they need a product. perts had to annotate the entire triggering page set. The resulting According to Table 3, the best results were generated by the numbers of intentional and non-intentional labeled sentences classifier with chi-square features(10,000): thus, we performed a vere 552 and 138, respectively. We conducted the identical intex, series of experiments o as an IR miEe asses ings 95.0% and 83.2%8 nvestigate the effect on different feature with features selected by chi-square gave the better results, yield- tures can produce a better performance achieving 95.0% and 83. 2% ing an average 94.2% F-measure for the non-intentional class and for the non-intentional and intentional classes respectively an average 80.5% F-measure for the intentional class. For the non-intentional class, there are no differences among three fe sets. However, for the intentional class, using unigrams aI 4.3. Evaluation of sentiment detection hi-square can outperform the svM baseline system by 10% in terms of F-measure As can be seen tables 3 and 4. for the e experimented with two standard algorithms: a decision tree and a support vector machine(Svm). Due to space limitations, we only present the best performance generated by the SvM classifier 10http://www.wjh.harvard.edu/inquirer/need.html in this paper. The goals of this section are similar to ( Kim Hovy
model performed on a different data source (50 triggering pages including intentional sentences). For the ebay.com and Wikipedia data, we adopted a fivefold cross validation mechanism for the learning algorithm. The precision (proportion of identified sentences that are marked as intentional or non-intentional sentences to all the identified sentences), recall (proportion of the marked intentional or nonintentional sentences that are identified, out of all marked intentional or non-intentional sentences available) and F-measure (weighted harmonic mean of precision and recall) were used as evaluation measures. Besides, we regarded the SVM with needbearing words10 as our baseline system. Table 2 shows intention recognition results generated by the SVM classifier with different feature sets. As can be seen in this table, the best F-measure (94.0%) for the non-intentional class and the best F-measure (92.5%) for the intentional class performances appeared using the SVM with need-bearing words and unigrams, respectively. Moreover, for the non-intentional class our results also suggest that there are no significant differences in terms of precision, recall and F-measure among an SVM that used unigrams, features selected by chisquare, and need-bearing words. As for the intentional class, using unigrams and the chi-square can outperform the baseline system by about 20% in terms of F-measure. According to above results, for the dataset of triggering pages (i.e., 50 articles including intentional sentences), owing to a lack of training data, we subsequently chose learning algorithms with different feature sets to train the models on the ebay.com and Wikipedia datasets. We then applied these models to our triggering page dataset. In order to obtain a reasonable evaluation, human experts had to annotate the entire triggering page set. The resulting numbers of intentional and non-intentional labeled sentences were 552 and 138, respectively. We conducted the identical intention recognition step and the results are shown in Table 3. SVM with features selected by chi-square gave the better results, yielding an average 94.2% F-measure for the non-intentional class and an average 80.5% F-measure for the intentional class. For the non-intentional class, there are no differences among three feature sets. However, for the intentional class, using unigrams and the chi-square can outperform the SVM baseline system by about 10% in terms of F-measure. As can be seen Tables 3 and 4, for the non-intentional class, there are no significant differences. However, for the intentional class, there are significant differences of about 10% between ebay.com and triggering pages in terms of precision, recall, and F-measure. It seems reasonable to interpret that the data from ebay.com mainly describe the product purchasing intention in a concise sentence and do not provide reasons why they need this product (e.g., ‘‘I want to buy a phone because...”). However, our triggering pages selected from individual blogs not only contain personal purchasing intentions but provide in detail some reasons why they need a product. According to Table 3, the best results were generated by the classifier with chi-square features (10,000); thus, we performed a series of experiments to investigate the effect on different feature sizes, as shown in Fig. 5. As can be seen in this figure, using 100 features can produce a better performance, achieving 95.0% and 83.2% for the non-intentional and intentional classes respectively. 4.3. Evaluation of sentiment detection We experimented with two standard algorithms: a decision tree and a support vector machine (SVM). Due to space limitations, we only present the best performance generated by the SVM classifier in this paper. The goals of this section are similar to (Kim & Hovy, Table 2 Intention and non-intention sentence discovery results. Feature set (# of features) Class Precision (%) Recall (%) F-measure (%) Unigram (45,030) Non-intentional 93.5 91.3 92.4 Intentional 91.5 93.6 92.5 Chi-square (10,000) Non-intentional 91.1 94.0 92.5 Intentional 93.8 90.8 92.4 Need-bearing words (77) Non-intentional 90.0 98.4 94.0 Intentional 89.7 56.5 69.3 Table 3 Intension recognition by various feature sets. Feature set (# of features) Class Precision (%) Recall (%) F-measure (%) Unigram (45,030) Non-intentional 94.5 93.5 94.0 Intentional 77.8 80.9 79.4 Chi-square (10,000) Non-intentional 95.5 92.9 94.2 Intentional 77.1 84.2 80.5 Need-bearing words (77) Non-intentional 89.7 94.1 91.8 Intentional 74.6 61.8 67.6 10 http://www.wjh.harvard.edu/inquirer/Need.html. Table 4 Subjective and objective sentence identification results. Feature set (# of features) Class Precision (%) Recall (%) F-measure (%) Unigram (40,626) Objective 85.5 84.4 86.9 Subjective 88.0 85.0 86.5 Chi-square (1000) Objective 79.6 93.1 85.8 Subjective 91.7 76.2 83.2 Chi-square (3000) Objective 82.5 92.4 87.1 Subjective 91.3 80.4 85.5 Chi-square (5000) Objective 83.8 91.5 87.5 Subjective 90.7 82.3 86.3 Opinion-bearing words (20,508) Objective 80.4 86.9 83.6 Subjective 85.8 78.9 82.2 Baseline Objective 77.9 67.9 72.6 Subjective 50.2 62.6 55.6 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788 1783
T.-K Fan, C-H Chang/ Expert Systems with Applications 38(2011 )1777-1 Non-Intentional INtentional Table 5 Positive and negative sentence classification results. 9998g Precision Recall (%) F-measure(%) Unigram(32, 644) 85 Chi-square(1000) Negative 80.5 819 83 Positive 82.7 Chi-square(10,000) Negative 85.5 10501005001000500010000 Negative 64.3 Positive 76.2 63.1 Fig. 5. Effect of feature size on accuracy. Baseline Negative 51.4 2006): the first is to explore how well the positive and negative detection model performed with different approaches to the data collected from epinions. com; and the second is to investigate how chi-square (10,000) features. As can be seen in this table, the SVM with features selected by chi-square can outperform the well the trained model performed on a different data source(200 SVM with unigram and opinion-bearing words. Moreover,our pro- triggering pages ). Fan and Chang(2009)implemented the auto matic labeling approach proposed by kim and Hovy(2006) The re- line about 30% and 25% in terms of F-measure for negative and differences between (Kim Hovy, 2006)and(Fan Chang positive class, re 2009)and show that the automatic labeling method can effectively For the dataset of triggering pages, owing to a lack of training generate a training dataset for sentiment detection. In this paper. data and according to above results, we subsequently chose learn- ing algorithms with different feature sets to train the models or we compared the modified method(as discussed in Section 3.2) gering page dataset In order to get a reasonable evaluation, human mechanism as a learning algorithm. The precision(proportion of 6 positive and 458 negative sentences We conducted an identi identified sentences that are marked as subjective or objective sen- cal sentiment identification and classification step and the results tences to all the identified sentences), recall(proportion of the are shown in Table 6. As for the sentiment identification task, the marked subjective or objective sentences that are identified, out SVM with features selected by chi-square gave the better results, of all marked subjective and objective sentences available)and F- yielding an average 87.1% F-measure in the objective class and measure(weighted harmonic mean of precision and recall) were an average 69.9% F-measure in the subjective class. For the senti- used as evaluation measures. Table 4 shows subjective and objec- ment classification task, the best F-measure for the positive class tive sentence identification results generated by an SVM classifier(74.8%)and the best F-measure for the negative class(59.1%)are with different feature sets. The opinion-bearing words are pre-se- generated by the SvM with features selected by chi-square and he svm with u ated scores. As can be seen in this table, the best performances in modified automatic labeling approach can outperform the baseline le F-measure(87.5%)of the objective class and the F-measure cation and classification tasks, respectively. As can be seen in our (86.5%)of the subjective class appeared using SVM with chi-square results, there are some significant differences of about 10-15%be- that there are no significant differences in terms of precision, recall tween pinon. com and triggering pages in terms of precision, recall, and F-measure. It seems reasonable to infer that the data from lectedbychi-square.andopinion-bearingwordsrespectivelyepinion.commainlyfocusesonproductreviewsandlacksemo Moreover, our proposed modified automatic labeling method can tional descriptions(e.g, of sadness, fear, or surprise).However outperform baseline about 10% and 30% in terms of F-measure our triggering pages selected from individual blog posts not only For the sentiment classification experiment, we used the results of sentiment identification as input. The goal of this experiment 4.4. Evaluation of page-ad matching was to classify a subjective sentence into a suitable class(ie, po- sitive or negative). The precision(proportion of classified sentences The goal of this section is to investigate to what extent ad place- lat are marked as positive (or negative) sentence to all the classi ments are actually related to personal interests evident on the trig fied sentences), recall(proportion of the marked positive(or nega- gering pages. To evaluate our page-ad matching framework, we tive) subjective sentences that are classified, out of the entire compared the top-five ranked ads provided by three different rank- marked positive(or negative)sentences available)and F-measure ing methods: our proposed approach with a personalization mech- (weighted harmonic mean of precision and recall)are similarly anism (i.e, Blogger-Centric Contextual Advertising(BCCA)),ou sed as the evaluation metrics. Table 5 shows results for the senti- proposed approach without a personalization mechanism (i.e, ent classification experiment. The results clearly show that the Plain Contextual Advertising(PCA)excluding intention recognition best F-measure for the positive class(85.9%)and the best F-mea- and sentiment detection), and Google AdSense. Here we used the ure for the negative class(86. 2%)are produced by the SvM with SVM with features selected by chi-square method as intention
2006): the first is to explore how well the positive and negative detection model performed with different approaches to the data collected from epinions.com; and the second is to investigate how well the trained model performed on a different data source (200 triggering pages). Fan and Chang (2009) implemented the automatic labeling approach proposed by Kim and Hovy (2006). The results in (Fan & Chang, 2009) indicate that there are no significant differences between (Kim & Hovy, 2006) and (Fan & Chang, 2009) and show that the automatic labeling method can effectively generate a training dataset for sentiment detection. In this paper, since we similarly did not restrict the topic selections of epinion.comand adopted an SVM algorithm like (Fan & Chang, 2009), we compared the modified method (as discussed in Section 3.2) to the one (regarded as the baseline system) used in (Fan & Chang, 2009). For the epinions.com data, we adopted a fivefold cross validation mechanism as a learning algorithm. The precision (proportion of identified sentences that are marked as subjective or objective sentences to all the identified sentences), recall (proportion of the marked subjective or objective sentences that are identified, out of all marked subjective and objective sentences available) and Fmeasure (weighted harmonic mean of precision and recall) were used as evaluation measures. Table 4 shows subjective and objective sentence identification results generated by an SVM classifier with different feature sets. The opinion-bearing words are pre-selected from (Esuli & Sebastiani, 2006) according to the opinion-related scores. As can be seen in this table, the best performances in the F-measure (87.5%) of the objective class and the F-measure (86.5%) of the subjective class appeared using SVM with chi-square (5000) and unigram features respectively. Our results also suggest that there are no significant differences in terms of precision, recall and F-measure between an SVM that used unigrams, features selected by chi-square, and opinion-bearing words, respectively. Moreover, our proposed modified automatic labeling method can outperform baseline about 10% and 30% in terms of F-measure for the objective and subjective class respectively. For the sentiment classification experiment, we used the results of sentiment identification as input. The goal of this experiment was to classify a subjective sentence into a suitable class (i.e., positive or negative). The precision (proportion of classified sentences that are marked as positive (or negative) sentence to all the classi- fied sentences), recall (proportion of the marked positive (or negative) subjective sentences that are classified, out of the entire marked positive (or negative) sentences available) and F-measure (weighted harmonic mean of precision and recall) are similarly used as the evaluation metrics. Table 5 shows results for the sentiment classification experiment. The results clearly show that the best F-measure for the positive class (85.9%) and the best F-measure for the negative class (86.2%) are produced by the SVM with chi-square (10,000) features. As can be seen in this table, the SVM with features selected by chi-square can outperform the SVM with unigram and opinion-bearing words. Moreover, our proposed modified automatic labeling approach can outperform baseline about 30% and 25% in terms of F-measure for negative and positive class, respectively. For the dataset of triggering pages, owing to a lack of training data and according to above results, we subsequently chose learning algorithms with different feature sets to train the models on the epinion.com dataset. We then applied these models to our triggering page dataset. In order to get a reasonable evaluation, human experts had to annotate the entire triggering page set. The resulting numbers of objective and subjective labeled sentences were 4673 and 1434, respectively. The subjective sentences contained 976 positive and 458 negative sentences. We conducted an identical sentiment identification and classification step and the results are shown in Table 6. As for the sentiment identification task, the SVM with features selected by chi-square gave the better results, yielding an average 87.1% F-measure in the objective class and an average 69.9% F-measure in the subjective class. For the sentiment classification task, the best F-measure for the positive class (74.8%) and the best F-measure for the negative class (59.1%) are generated by the SVM with features selected by chi-square and the SVM with unigrams respectively. Moreover, our proposed modified automatic labeling approach can outperform the baseline by about 15% and 5% in terms of F-measure for sentiment identifi- cation and classification tasks, respectively. As can be seen in our results, there are some significant differences of about 10–15% between epinon.com and triggering pages in terms of precision, recall, and F-measure. It seems reasonable to infer that the data from epinion.com mainly focuses on product reviews and lacks emotional descriptions (e.g., of sadness, fear, or surprise). However, our triggering pages selected from individual blog posts not only contain product reviews but also cover individual emotions. 4.4. Evaluation of page-ad matching The goal of this section is to investigate to what extent ad placements are actually related to personal interests evident on the triggering pages. To evaluate our page-ad matching framework, we compared the top-five ranked ads provided by three different ranking methods: our proposed approach with a personalization mechanism (i.e., Blogger-Centric Contextual Advertising (BCCA)), our proposed approach without a personalization mechanism (i.e., Plain Contextual Advertising (PCA) excluding intention recognition and sentiment detection), and Google AdSense. Here we used the SVM with features selected by chi-square method as intention 75 77 79 81 83 85 87 89 91 93 95 10 50 100 500 1000 5000 10000 number of features selected F-measure Non-Intentional Intentional Fig. 5. Effect of feature size on accuracy. Table 5 Positive and negative sentence classification results. Feature set (# of features) Class Precision (%) Recall (%) F-measure (%) Unigram (32,644) Negative 80.8 81.4 81.1 Positive 81.3 80.7 81.0 Chi-square (1000) Negative 80.5 83.3 81.9 Positive 82.7 79.8 81.2 Chi-square (5000) Negative 84.4 85.9 85.1 Positive 85.6 84.1 84.9 Chi-square (10,000) Negative 85.5 86.8 86.2 Positive 86.6 85.3 85.9 Opinion-bearing words (3105) Negative 64.3 83.2 72.5 Positive 76.2 53.8 63.1 Baseline Negative 51.4 50.1 51.1 Positive 52.6 61.4 56.6 1784 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788
-K Fan, C-H. Chang/ Expert Systems with Applications 38(2011)1777-1788 System results for triggering pages Chi-squaI Objective ubjective Classification lenification 52.7 Baseline Subjective 58.2 54 Positive 71 and sentiment models according to experimental results. No more and 42% for Google Adsense). Regarding all triggering pages, our than 15 ads were retrieved and inserted into a pool for ea results show that the proposed bCCa approach can yield a better gering page. Since it is difficult to invite the original authors of blog performance(57%) than other approaches(42% for PCA approach actually to participate in an Ad Click-Through-Rate(CTr) experi- and 40% for Google AdSense). According to Table 7, these results ment, 14 volunteers participated in our CTR experiment. All the lead us to the conclusion that our BCCA framework can place ads advertisements in each pool were manually clicked by volunteers. that are related to the personal interest content of triggering pages We published each triggering page and its ads from the corre Although the results generated by our proposed method are sponding pool on a testing platform. In order to provide a fair test- better than Google's, in this paper we did not emphasize this con- ing environment, we ignored the effect of ad position order and clusion on the basis of two reasons. One reason is that Google Ad- randomly placed the relevant ads on a given triggering page. By Sense needs to select the recommended ads out of an ad pool that reading the content of a given post, the participants were regarded is vastly larger than the one used by us. Another plausible reason as the blog publisher, who then clicked the ads according to per- has to do with ad categories. That is, Google AdSense considers sonal interests. To compare with Google AdSense, we only mea- more ad categories than the categories we adopt ured the CTr for this experiment. The measure is the fraction of o investigate the generalization of our proposed framework. retrieved ads that are clicked. In order to investigate the effective the goal of our next experiment was to explore our ad assignment ness of our proposed method on a different dataset, we further se- strategies (i.e, intention recognition, sentiment detection and term lected a positive sentiment dataset(100 documents) and an expansion) as applied on different information retrieval models. intentional dataset(50 documents)from our triggering pages. Ta- We compared the language model (Lm)with another two well ble 7 shows the results for three page-ad matching methods across known IR algorithms: Okapi BM25(Robertson, Walker, Jones, Han- us types of datasets. In the case of the positive sentiment cock-Beaulieu, Gatford, 1994) and tf* idf(salton Buckley dataset, there are no significant differences among the three 1988). We similarly selected the top-five ranked ads provided by page-ad matching approaches. Our BCCA framework and PCA can these iR algorithms. We used all triggering pages as our dataset. respectively produce 52% and 53% in terms of precision. Google Ad- We thereby ensured that no more than 15 ads would be retrieved Sense achieves about 47% accuracy For the intention dataset, our and inserted into a pool for that triggering page. all the advertise- results show that the proposed BCCa approach can yield a better ments in each pool were manually judged by experts. The experts performance(58%) than other approaches(32% for PCA approach mainly evaluated each page-ad pair according two principles, cor- relation and intention. The correlation principle is related to whether an ad is positively related to the bloggers'interests as re- Table 7 vealed by a given blog page. The intention principle is based on Accuracy of page-ad matching. whether experts have any intention to click an ad. An ad judged as gold-standard has to comply with both correlation and intention principles. The experts were divided into two groups to judge inde- pendently for the gold-standard; furthermore, the average pa Google AdSense wise ag ent measure and Kappa coefficient value between Negative dataset 65 two teams reached 0.93 and 0.85, respectively. the average num- ber of relevant advertisements was 8 per triggering page. The results of our proposed page-ad matching on different IR ap- Intention dataset BCCA proaches are shown in Table 8. As can be seen in Table 8, there are no significant differences in terms of accuracy among three IR approaches. For all triggering pages, using the LM approach All triggering pages can yield better performance(64%)than other approaches(62% Google Adsense for Okapi BM25 and 60% for tf* idf). These results lead us to con- clude that our ad assignment strategies actually can be employed
and sentiment models according to experimental results. No more than 15 ads were retrieved and inserted into a pool for each triggering page. Since it is difficult to invite the original authors of blog actually to participate in an Ad Click-Through-Rate (CTR) experiment, 14 volunteers participated in our CTR experiment. All the advertisements in each pool were manually clicked by volunteers. We published each triggering page and its ads from the corresponding pool on a testing platform. In order to provide a fair testing environment, we ignored the effect of ad position order and randomly placed the relevant ads on a given triggering page. By reading the content of a given post, the participants were regarded as the blog publisher, who then clicked the ads according to personal interests. To compare with Google AdSense, we only measured the CTR for this experiment. The measure is the fraction of retrieved ads that are clicked. In order to investigate the effectiveness of our proposed method on a different dataset, we further selected a positive sentiment dataset (100 documents) and an intentional dataset (50 documents) from our triggering pages. Table 7 shows the results for three page-ad matching methods across various types of datasets. In the case of the positive sentiment dataset, there are no significant differences among the three page-ad matching approaches. Our BCCA framework and PCA can respectively produce 52% and 53% in terms of precision. Google AdSense achieves about 47% accuracy. For the intention dataset, our results show that the proposed BCCA approach can yield a better performance (58%) than other approaches (32% for PCA approach and 42% for Google Adsense). Regarding all triggering pages, our results show that the proposed BCCA approach can yield a better performance (57%) than other approaches (42% for PCA approach and 40% for Google AdSense). According to Table 7, these results lead us to the conclusion that our BCCA framework can place ads that are related to the personal interest content of triggering pages. Although the results generated by our proposed method are better than Google’s, in this paper we did not emphasize this conclusion on the basis of two reasons. One reason is that Google AdSense needs to select the recommended ads out of an ad pool that is vastly larger than the one used by us. Another plausible reason has to do with ad categories. That is, Google AdSense considers more ad categories than the categories we adopted. To investigate the generalization of our proposed framework, the goal of our next experiment was to explore our ad assignment strategies (i.e., intention recognition, sentiment detection and term expansion) as applied on different information retrieval models. We compared the language model (LM) with another two wellknown IR algorithms: Okapi BM25 (Robertson, Walker, Jones, Hancock-Beaulieu, & Gatford, 1994) and tf idf (Salton & Buckley, 1988). We similarly selected the top-five ranked ads provided by these IR algorithms. We used all triggering pages as our dataset. We thereby ensured that no more than 15 ads would be retrieved and inserted into a pool for that triggering page. All the advertisements in each pool were manually judged by experts. The experts mainly evaluated each page-ad pair according two principles, correlation and intention. The correlation principle is related to whether an ad is positively related to the bloggers’ interests as revealed by a given blog page. The intention principle is based on whether experts have any intention to click an ad. An ad judged as gold-standard has to comply with both correlation and intention principles. The experts were divided into two groups to judge independently for the gold-standard; furthermore, the average pairwise agreement measure and Kappa coefficient value between two teams reached 0.93 and 0.85, respectively. The average number of relevant advertisements was 8 per triggering page. The results of our proposed page-ad matching on different IR approaches are shown in Table 8. As can be seen in Table 8, there are no significant differences in terms of accuracy among three IR approaches. For all triggering pages, using the LM approach can yield better performance (64%) than other approaches (62% for Okapi BM25 and 60% for tf idf). These results lead us to conclude that our ad assignment strategies actually can be employed Table 7 Accuracy of page-ad matching. Dataset Method CTR (%) Positive dataset BCCA 52 PCA 53 Google AdSense 47 Negative dataset BCCA 65 PCA 30 Google AdSense 33 Intention dataset BCCA 58 PCA 32 Google AdSense 42 All triggering pages BCCA 57 PCA 42 Google AdSense 40 Table 6 System results for triggering pages. Feature set Task Class Precision (%) Recall (%) F-measure (%) Unigram Identification Objective 81.1 88.9 84.8 Subjective 75.8 62.8 67.7 Classification Negative 80.6 46.6 59.1 Positive 70.0 65.0 67.4 Chi-square Identification Objective 80.5 95.0 87.1 Subjective 86.6 58.8 69.9 Classification Negative 63.8 52.6 57.6 Positive 96.0 61.3 74.8 Opinion-bearing words Identification Objective 73.7 88.3 80.3 Subjective 67.2 43.3 52.7 Classification Negative 46.1 42.0 44.0 Positive 89.1 44.0 58.9 Baseline Identification Objective 78.5 64.5 70.8 Subjective 50.1 67.7 58.2 Classification Negative 56.1 53.1 54.6 Positive 56.2 71.1 68.2 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788 1785
T.-K Far, C-H Chang/ Expert Systems with Applications 38(2011)1777-I Accuracy of page-ad matching Precision G 0.8 ntention datase All triggering pages 0.5 okapi BM25 dataset All triggering pages 0.1 F*IDF Negative datase 82 Language Model Okapi BM25 TEMIDE Intention dataset All triggering pages Fig. 7. The performance of the three IR models in any IR models for placing blogger-centric ads In addition to ccuracy, we also used the precision-recall curve to show our eval- The results are displayed in Fig. 7. It is clear that the Okapi BM uation, as shown in Fig. 6. Each data point corresponds to the pre 25 and td idf model can generate MAP of around 41% and 32% cision value calculated at a certain percentage of recall. The results respectively; besides, improved performance(of around 43%)can clearly indicated that using the language model can achieve better be produced by language model. As shown in these figures, the re- performance than the use of okapi BM 25 and tf* idf. Besides, we sults based on language model are apparently very powerful and adopted another presentation involving two quality measures consistently superior than using Okapi BM25 and tf* idf. Precision@K and mean average precision(MAP)) to assess match average retrieval precision computed at recall level K as follows: Several prior research studies are relevant to our work, includ- g efforts in online advertising and sentiment classification. Precison(k Many personalized advertising methods are proposed that make use of explicit user profiles, which are gathered, maintained, and analyzed by the ad placing system. Such methods often make is the number of queries used, and P is the precision at recall le- 2002). Many web portals create user profiles using information gained during the registration process. However, due to the consid- To compare the precision-recall curves across the three page-ad eration for privacy, users tend to give incorrect data. In addition to tching functions, we computed MAP. For a single query. Average Precision is the average of the precision value derived stored in the web server logs(Bae, Park, Ha, 2003). Several stud- for the set of k top documents that exist after each relevant doc ument is retrieved. This value is then average over all queries. of relevant associations for consumers(Langheinrich, Nakamura dn,d,dm) and Ri is the set of ranked retrieval results from ads can turn off users and relevant ads are more likely to be clicked the top result until the retrieval system returns the documents ( Chatterjee et al, 2003; Parsons, Gallagher, Foster, 2000). They show that advertisements that are presented to users who are MAP@)-Q2mEP not interested can result in customer annoyance. Thus, in order Precision(Rik to be effective. the authors conclude that advertisements should be relevant to a consumers interests at the time of exposure. No- vak and Hoffman(1997) reinforce this conclusion by pointing out whe ad matching models, we computed the MAP score for three Q(q1, q2..,9m) is a set of queries. Since we have three that the more targeted the advertising, the more effective it is. as a result. certain studies have tried to determine how to take advan- tage of the available evidence to enhance the relevance of selected ads. For example, studies on keyword matching show that the nat ure and number of keywords affect the likelihood of an ad being clicked(OneUpWeb, 2005). As for contextual advertising, Ribeiro- Okapi Neto et al. (2005) proposed a number of strategies for matching TF*IDF pages to ads based on extracted keywords. The first five strategies proposed in this work match pages and ads based on the cosine of x the angle between their respective vectors. To identify the impor nt parts of the sections (e.g, bid phrase, title, and body )as a basis for the ad vec- 0.1 tor. The winning strategy required the bid phrase to appear on the page, and then ranked all such ads using the cosine of the union of 00.10.2030405060.70.80.91 all the ad sections and the page vectors. while both pages and ads Recall are mapped to the same space, there exists a discrepancy(called 'impedance mismatch")between the vocabulary used in the ads Fig. 6. Precision-recall curve. and on the pages. Hence, the authors achieved improved matching
in any IR models for placing blogger-centric ads. In addition to accuracy, we also used the precision-recall curve to show our evaluation, as shown in Fig. 6. Each data point corresponds to the precision value calculated at a certain percentage of recall. The results clearly indicated that using the language model can achieve better performance than the use of Okapi BM 25 and tf idf. Besides, we adopted another presentation involving two quality measures (Precision@K and mean average precision (MAP)) to assess matching results: We calculated the average retrieval precision computed at recall level K as follows: Precison@ðKÞ ¼ PNq i¼1Pi@ðKÞ Nq where Precision@(K) is the average precision at recall level K, Nq is the number of queries used, and Pi is the precision at recall level K for the ith query. To compare the precision-recall curves across the three page-ad matching functions, we computed MAP. For a single query, Average Precision is the average of the precision value derived for the set of k top documents that exist after each relevant document is retrieved. This value is then average over all queries. That is, if the set of relevant documents for a query q 2 Q is {d1,d2,...,dmj} and Rjk is the set of ranked retrieval results from the top result until the retrieval system returns the documents dk, then MAPðQÞ ¼ 1 jQj X jQj j¼1 1 mj Xmj k¼1 PrecisionðRjkÞ where, Q {q1, q2,..., qm} is a set of queries. Since we have three page-ad matching models, we computed the MAP score for three query sets. The results are displayed in Fig. 7. It is clear that the Okapi BM 25 and td idf model can generate MAP of around 41% and 32%, respectively; besides, improved performance (of around 43%) can be produced by language model. As shown in these figures, the results based on language model are apparently very powerful and consistently superior than using Okapi BM25 and tf idf. 5. Related work Several prior research studies are relevant to our work, including efforts in online advertising and sentiment classification. Many personalized advertising methods are proposed that make use of explicit user profiles, which are gathered, maintained, and analyzed by the ad placing system. Such methods often make use of data-mining techniques (Lai & Yang, 2000; Perner & Fiss, 2002). Many web portals create user profiles using information gained during the registration process. However, due to the consideration for privacy, users tend to give incorrect data. In addition to user profiles, an alternative solution is to exploit information stored in the web server logs (Bae, Park, & Ha, 2003). Several studies pertaining to advertising research have stressed the importance of relevant associations for consumers (Langheinrich, Nakamura, Abe, Kamba, & Koseki, 1999; Wang et al., 2002) and how irrelevant ads can turn off users and relevant ads are more likely to be clicked (Chatterjee et al., 2003; Parsons, Gallagher, & Foster, 2000). They show that advertisements that are presented to users who are not interested can result in customer annoyance. Thus, in order to be effective, the authors conclude that advertisements should be relevant to a consumer’s interests at the time of exposure. Novak and Hoffman (1997) reinforce this conclusion by pointing out that the more targeted the advertising, the more effective it is. As a result, certain studies have tried to determine how to take advantage of the available evidence to enhance the relevance of selected ads. For example, studies on keyword matching show that the nature and number of keywords affect the likelihood of an ad being clicked (OneUpWeb, 2005). As for contextual advertising, RibeiroNeto et al. (2005) proposed a number of strategies for matching pages to ads based on extracted keywords. The first five strategies proposed in this work match pages and ads based on the cosine of the angle between their respective vectors. To identify the important parts of the ad, the authors explored the use of different ad sections (e.g., bid phrase, title, and body) as a basis for the ad vector. The winning strategy required the bid phrase to appear on the page, and then ranked all such ads using the cosine of the union of all the ad sections and the page vectors. While both pages and ads are mapped to the same space, there exists a discrepancy (called ‘‘impedance mismatch”) between the vocabulary used in the ads and on the pages. Hence, the authors achieved improved matching Table 8 Accuracy of page-ad matching. IR method Dataset Accuracy (%) Language model Positive dataset 57 Negative dataset 79 Intention dataset 57 All triggering pages 64 Okapi BM25 Positive dataset 52 Negative dataset 80 Intention dataset 50 All triggering pages 62 TF IDF Positive dataset 50 Negative dataset 82 Intention dataset 52 All triggering pages 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision LM Okapi TF*IDF Fig. 6. Precision-recall curve. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Language Model Okapi BM25 TF*IDF IR models Score. MAP Precision@1 Precision@2 Precision@3 Fig. 7. The performance of the three IR models. 1786 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788