《人工智能、机器学习与大数据》课程教学资源（参考文献）Emoticon smoothed language models for Twitter sentiment analysis

团购合买资源类别：文库，文档格式：PDF，文档页数：7，文件大小：233.66KB

Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu,Wu-Jun Li,Minyi Guo Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering.Shanghai Jiao Tong University,China liukunlin@sjtu.edu.cn,{liwujun,guo-my}@cs.sjtu.edu.cn Abstract further classified as positive or negative.Hence,two clas- Twitter sentiment analysis (TSA)has become a hot research sifiers are trained for the whole SA process,one is called topic in recent years.The goal of this task is to discover subjectivity classifier,and the other is called polarity classi- the attitude or opinion of the tweets,which is typically fier.Since(Pang,Lee,and Vaithyanathan 2002)formulated formulated as a machine learning based text classification SA as a machine learning based text classification problem, problem.Some methods use manually labeled data to more and more machine learning methods have been pro- train fully supervised models,while others use some noisy posed for SA(Pang and Lee 2007). labels,such as emoticons and hashtags,for model training. Twitter is a popular online micro-blogging service In general,we can only get a limited number of training launched in 2006.Users on Twitter write tweets up to 140 data for the fully supervised models because it is very characters to tell others about what they are doing and think- labor-intensive and time-consuming to manually label the tweets.As for the models with noisy labels,it is hard for ing.According to the some sources,until 2011,there have them to achieve satisfactory performance due to the noise been over 300 million users on Twitter and 300 million new in the labels although it is easy to get a large amount of tweets are generated every day.Because almost all tweets data for training.Hence,the best strategy is to utilize both are public,these rich data offer new opportunities for do- manually labeled data and noisy labeled data for training. ing research on data mining and natural language process- However,how to seamlessly integrate these two different ing(Liu et al.2011a;2011b;2011c;Jiang et al.2011). kinds of data into the same learning framework is still a One way to perform Twitter sentiment analysis (TSA)is challenge.In this paper,we present a novel model,called to directly exploit traditional SA methods (Pang and Lee emoticon smoothed language model (ESLAM).to handle 2007).However,tweets are quite different from other text this challenge.The basic idea is to train a language model based on the manually labeled data,and then use the noisy forms like product reviews and news articles.Firstly,tweets emoticon data for smoothing.Experiments on real data sets are often short and ambiguous because of the limitation of demonstrate that ESLAM can effectively integrate both kinds characters.Secondly,there're more misspelled words,slang, of data to outperform those methods using only one of them. modal particles and acronyms on Twitter because of its ca- sual form.Thirdly,a huge amount of unlabeled or noisy la- Introduction beled data can be easily downloaded through Twitter APL. Therefore,many novel SA methods have been specially de- Sentiment analysis(SA)(Pang and Lee 2007)(also known veloped for TSA.These methods can be mainly divided into as opinion mining)is mainly about discovering"what others two categories:fully supervised methods and distantly su- think"from data such as product reviews and news articles. pervised methods2. On one hand,consumers can seek advices about a product The fully supervised methods try to learn the classi- to make informed decisions in the consuming process.On fiers from manually labeled data.(Jansen et al.2009)uses the other hand,vendors are paying more and more atten- the multinomial Bayes model to perform automatic TSA. tion to online opinions about their products and services. (Bermingham and Smeaton 2010)compares support vector Hence,SA has attracted increasing attention from many re- machine (SVM)and multinomial naive Bayes (MNB)for search communities such as machine learning,data mining, both blog and microblog SA,and finds that SVM outper- and natural language processing.The sentiment of a docu- forms MNB on blogs with long text but MNB outperforms ment or sentence can be positive,negative or neutral.Hence, SA is actually a three-way classification problem.In prac- SVM on microblogs with short text.One problem with the fully supervised methods is that it is very labor-intensive and tice,most methods adopt a two-step strategy for SA (Pang time-consuming to manually label the data and hence the and Lee 2007).In the subjectivity classification step,the tar- training data sets for most methods are often too small to get is classified to be subjective or neutral (objective),and in the polarity classification step,the subjective targets are http://en.wikipedia.org/wiki/Twitter Copyright C)2012,Association for the Advancement of Artificial 2We use the terminology 'distant'as that from(Go,Bhayani, Intelligence (www.aaai.org).All rights reserved. and Huang 2009)

Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China liukunlin@sjtu.edu.cn, {liwujun,guo-my}@cs.sjtu.edu.cn Abstract Twitter sentiment analysis (TSA) has become a hot research topic in recent years. The goal of this task is to discover the attitude or opinion of the tweets, which is typically formulated as a machine learning based text classification problem. Some methods use manually labeled data to train fully supervised models, while others use some noisy labels, such as emoticons and hashtags, for model training. In general, we can only get a limited number of training data for the fully supervised models because it is very labor-intensive and time-consuming to manually label the tweets. As for the models with noisy labels, it is hard for them to achieve satisfactory performance due to the noise in the labels although it is easy to get a large amount of data for training. Hence, the best strategy is to utilize both manually labeled data and noisy labeled data for training. However, how to seamlessly integrate these two different kinds of data into the same learning framework is still a challenge. In this paper, we present a novel model, called emoticon smoothed language model (ESLAM), to handle this challenge. The basic idea is to train a language model based on the manually labeled data, and then use the noisy emoticon data for smoothing. Experiments on real data sets demonstrate that ESLAM can effectively integrate both kinds of data to outperform those methods using only one of them. Introduction Sentiment analysis (SA) (Pang and Lee 2007) (also known as opinion mining) is mainly about discovering “what others think” from data such as product reviews and news articles. On one hand, consumers can seek advices about a product to make informed decisions in the consuming process. On the other hand, vendors are paying more and more attention to online opinions about their products and services. Hence, SA has attracted increasing attention from many research communities such as machine learning, data mining, and natural language processing. The sentiment of a document or sentence can be positive, negative or neutral. Hence, SA is actually a three-way classification problem. In practice, most methods adopt a two-step strategy for SA (Pang and Lee 2007). In the subjectivity classification step, the target is classified to be subjective or neutral (objective), and in the polarity classification step, the subjective targets are Copyright c 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. further classified as positive or negative. Hence, two classifiers are trained for the whole SA process, one is called subjectivity classifier, and the other is called polarity classi- fier. Since (Pang, Lee, and Vaithyanathan 2002) formulated SA as a machine learning based text classification problem, more and more machine learning methods have been proposed for SA (Pang and Lee 2007). Twitter is a popular online micro-blogging service launched in 2006. Users on Twitter write tweets up to 140 characters to tell others about what they are doing and thinking. According to the some sources 1 , until 2011, there have been over 300 million users on Twitter and 300 million new tweets are generated every day. Because almost all tweets are public, these rich data offer new opportunities for doing research on data mining and natural language processing(Liu et al. 2011a; 2011b; 2011c; Jiang et al. 2011). One way to perform Twitter sentiment analysis (TSA) is to directly exploit traditional SA methods (Pang and Lee 2007). However, tweets are quite different from other text forms like product reviews and news articles. Firstly, tweets are often short and ambiguous because of the limitation of characters. Secondly, there’re more misspelled words, slang, modal particles and acronyms on Twitter because of its casual form. Thirdly, a huge amount of unlabeled or noisy labeled data can be easily downloaded through Twitter API. Therefore, many novel SA methods have been specially developed for TSA. These methods can be mainly divided into two categories: fully supervised methods and distantly supervised methods2 . The fully supervised methods try to learn the classi- fiers from manually labeled data. (Jansen et al. 2009) uses the multinomial Bayes model to perform automatic TSA. (Bermingham and Smeaton 2010) compares support vector machine (SVM) and multinomial naive Bayes (MNB) for both blog and microblog SA, and finds that SVM outperforms MNB on blogs with long text but MNB outperforms SVM on microblogs with short text. One problem with the fully supervised methods is that it is very labor-intensive and time-consuming to manually label the data and hence the training data sets for most methods are often too small to 1http://en.wikipedia.org/wiki/Twitter 2We use the terminology ‘distant’ as that from (Go, Bhayani, and Huang 2009)

guarantee a good performance. Related Work More recent works have focused on distantly supervised SA (Pang and Lee 2007)has a long history in natural lan- methods which learn the classifiers from data with noisy la bels such as emoticons and hashtags.The distant supervi- guage processing.Before (Pang,Lee,and Vaithyanathan 2002),almost all methods are partially knowledge-based. sion method (Go,Bhayani,and Huang 2009)uses the emoti- cons like“:)”and“:'as noisy labels for polarity classifica- (Pang,Lee,and Vaithyanathan 2002)shows that machine learning techniques,such as naive Bayes,maximum entropy tion.The basic assumption is that a tweet containing ":) classifiers,and SVM can outperform the knowledge-based is most likely to have a positive emotion and that contain- baselines on movie reviews.After that,the machine learn ing":("is assumed to be negative.Experiments show that ing based methods have become the mainstream for SA. these emoticons do contain some discriminative informa- Earlier works on TSA follow the methods of traditional tion for SA.Hashtags (e.g.,#sucks)or Smileys are used in (Davidov,Tsur,and Rappoport 2010)to identify sentiment SA on normal text forms like movie reviews.These methods types.(Barbosa and Feng 2010)uses the noisy data collected are mainly fully supervised (Jansen et al.2009:Bermingham from some Twitter sentiment detection web sites.such as the and Smeaton 2010)which have been introduced in the Intro- Twitter Sentiment3.(Kouloumpis,Wilson,and Moore 2011) duction section.Most recent works include target-dependent SA based on SVM (Jiang et al.2011).user-level SA based investigates both hashtags and emoticons and finds that com- on social networks (Tan et al.2011),sentiment stream anal- bining both of them can get better performance than using only hashtags.The advantage of these distantly supervised ysis based on association rules(Silva et al.2011),and real- methods is that the labor-intensive manual annotation can time SA(Guerra et al.2011). be avoided and a large amount of training data can be easily Recently,more and more distantly supervised methods built,either from Twitter API or existing web sites.How- are proposed.(Go,Bhayani,and Huang 2009)'s training data consist of tweets with emoticons like ")and":(and ever,due to the noise in the labels,the accuracy of these methods is not satisfactory. they use these emoticons as noisy labels.(Davidov,Tsur, Considering the shortcomings of the fully supervised and and Rappoport 2010)uses 50 Twitter tags and 15 smileys distantly supervised methods,we argue that the best strat- as noisy labels to identify and classify diverse sentiment egy is to utilize both manually labeled data and noisy la- types of tweets.Other methods with noisy labels(Barbosa beled data for training.However,how to seamlessly inte- and Feng 2010;Kouloumpis,Wilson,and Moore 2011)are grate these two different kinds of data into the same learn- also proposed.All these methods cannot handle subjectiv- ing framework is still a challenge.In this paper,we propose ity classification well.Furthermore,these methods need to a novel model,called emoticon smoothed language model crawl all the data and store them in the local disks.This is very inefficient when millions or even billions of tweets are (ESLAM),to handle this challenge.The main contributions of ESLAM are outlined as follows: used because request rate for crawling tweets is limited by Twitter server. ESLAM uses the noisy emoticon data to smooth the lan- Although a lot of TSA methods have been proposed,few guage model trained from manually labeled data.Hence, of them can effectively integrate both manually labeled data ESLAM seamlessly integrate both manually labeled data and noisy labeled data into the same framework,which mo- and noisy labeled data into a probabilistic framework.The tivates our ESLAM work in this paper. large amount of noisy emoticon data gives ESLAM have the power to deal with misspelled words,slang,modal particles,acronyms,and the unforseen test words,which Our Approach cannot be easily handled by fully supervised methods. In this section,first we present how to adapt language mod- els (Manning,Raghavan,and Schutze 2009)for SA.Then Besides the polarity classification,ESLAM can also be we propose a very effective and efficient way to learn the used for subjectivity classification which cannot be han- emoticon model from Twitter API.Finally.we will intro- dled by most existing distantly supervised methods. duce the strategy to seamlessly integrate both manually la- Rather than crawling a large amount of noisy data to lo- beled data and emoticon data into a probabilistic framework cal disks which is a typical choice by existing distantly which is our ESLAM method. supervised methods,we propose an efficient and conve- nient way to directly estimate the word probabilities from Language Models for SA Twitter API without downloading any tweet.This is very Language models (LM)can be either probabilistic or non- promising because it is very expensive in terms of time and storage to download and process large amount of probabilistic.In this paper,we refer to probabilistic lan- tweets. guage models which are widely used in information retrieval and natural language processing (Ponte and Croft 1998; Experiments on real data sets demonstrate that ESLAM Zhai and Lafferty 2004;Manning,Raghavan,and Schutze can effectively integrate both manually labeled data and 2009).A LM assign a probability to a sequence of words.In noisy labeled data to outperform those methods using information retrieval,first we estimate a LM for each doc- only one of them. ument,then we can compute a likelihood measuring how likely a query is generated by each document LM and rank http://twittersentiment.appspot.com/ the documents with respect to the likelihoods

guarantee a good performance. More recent works have focused on distantly supervised methods which learn the classifiers from data with noisy labels such as emoticons and hashtags. The distant supervision method (Go, Bhayani, and Huang 2009) uses the emoticons like “:)” and “:(” as noisy labels for polarity classification. The basic assumption is that a tweet containing “:)” is most likely to have a positive emotion and that containing “:(” is assumed to be negative. Experiments show that these emoticons do contain some discriminative information for SA. Hashtags (e.g., #sucks) or Smileys are used in (Davidov, Tsur, and Rappoport 2010) to identify sentiment types. (Barbosa and Feng 2010) uses the noisy data collected from some Twitter sentiment detection web sites, such as the Twitter Sentiment3 . (Kouloumpis, Wilson, and Moore 2011) investigates both hashtags and emoticons and finds that combining both of them can get better performance than using only hashtags. The advantage of these distantly supervised methods is that the labor-intensive manual annotation can be avoided and a large amount of training data can be easily built, either from Twitter API or existing web sites. However, due to the noise in the labels, the accuracy of these methods is not satisfactory. Considering the shortcomings of the fully supervised and distantly supervised methods, we argue that the best strategy is to utilize both manually labeled data and noisy labeled data for training. However, how to seamlessly integrate these two different kinds of data into the same learning framework is still a challenge. In this paper, we propose a novel model, called emoticon smoothed language model (ESLAM), to handle this challenge. The main contributions of ESLAM are outlined as follows: • ESLAM uses the noisy emoticon data to smooth the language model trained from manually labeled data. Hence, ESLAM seamlessly integrate both manually labeled data and noisy labeled data into a probabilistic framework. The large amount of noisy emoticon data gives ESLAM have the power to deal with misspelled words, slang, modal particles, acronyms, and the unforseen test words, which cannot be easily handled by fully supervised methods. • Besides the polarity classification, ESLAM can also be used for subjectivity classification which cannot be handled by most existing distantly supervised methods. • Rather than crawling a large amount of noisy data to local disks which is a typical choice by existing distantly supervised methods, we propose an efficient and convenient way to directly estimate the word probabilities from Twitter API without downloading any tweet. This is very promising because it is very expensive in terms of time and storage to download and process large amount of tweets. • Experiments on real data sets demonstrate that ESLAM can effectively integrate both manually labeled data and noisy labeled data to outperform those methods using only one of them. 3 http://twittersentiment.appspot.com/ Related Work SA (Pang and Lee 2007) has a long history in natural language processing. Before (Pang, Lee, and Vaithyanathan 2002), almost all methods are partially knowledge-based. (Pang, Lee, and Vaithyanathan 2002) shows that machine learning techniques, such as naive Bayes, maximum entropy classifiers, and SVM can outperform the knowledge-based baselines on movie reviews. After that, the machine learning based methods have become the mainstream for SA. Earlier works on TSA follow the methods of traditional SA on normal text forms like movie reviews. These methods are mainly fully supervised (Jansen et al. 2009; Bermingham and Smeaton 2010) which have been introduced in the Introduction section. Most recent works include target-dependent SA based on SVM (Jiang et al. 2011), user-level SA based on social networks (Tan et al. 2011), sentiment stream analysis based on association rules (Silva et al. 2011), and realtime SA (Guerra et al. 2011). Recently, more and more distantly supervised methods are proposed. (Go, Bhayani, and Huang 2009)’s training data consist of tweets with emoticons like “:)” and “:(” and they use these emoticons as noisy labels. (Davidov, Tsur, and Rappoport 2010) uses 50 Twitter tags and 15 smileys as noisy labels to identify and classify diverse sentiment types of tweets. Other methods with noisy labels (Barbosa and Feng 2010; Kouloumpis, Wilson, and Moore 2011) are also proposed. All these methods cannot handle subjectivity classification well. Furthermore, these methods need to crawl all the data and store them in the local disks. This is very inefficient when millions or even billions of tweets are used because request rate for crawling tweets is limited by Twitter server. Although a lot of TSA methods have been proposed, few of them can effectively integrate both manually labeled data and noisy labeled data into the same framework, which motivates our ESLAM work in this paper. Our Approach In this section, first we present how to adapt language models (Manning, Raghavan, and Schutze 2009) for SA. Then we propose a very effective and efficient way to learn the emoticon model from Twitter API. Finally, we will introduce the strategy to seamlessly integrate both manually labeled data and emoticon data into a probabilistic framework which is our ESLAM method. Language Models for SA Language models (LM) can be either probabilistic or nonprobabilistic. In this paper, we refer to probabilistic language models which are widely used in information retrieval and natural language processing (Ponte and Croft 1998; Zhai and Lafferty 2004; Manning, Raghavan, and Schutze 2009). A LM assign a probability to a sequence of words. In information retrieval, first we estimate a LM for each document, then we can compute a likelihood measuring how likely a query is generated by each document LM and rank the documents with respect to the likelihoods

TSA is actually a classification problem.To adapt LM for to estimate the emoticon LM P(wilc)from Twitter Search TSA,we concatenate all the tweets from the same class to API.Twitter Search API 4 is a dedicated API for running form one synthetic document.Hence,for the polarity clas- searches against the real-time index of recent tweets.Its in- sification problem,one document is constructed from posi- dex includes tweets between 6-9 days.Given a query which tive training tweets,and the other document is constructed consists of one or several words,the API returns up to 1500 from negative training tweets.Then we learn two LMs.one relevant tweets and their posting time. for positive class and the other for negative class.The LM learning procedure for subjectivity classification is similar. Polarity Classification To get Pu(wilc1),the probabil- During the test phase,we treat each test tweet as a query, ity of w;in positive class,we make an assumption that all and then we can use the likelihoods to rank the classes.The tweets containing“:)”are positive.We build a query“wi:)" class with the highest likelihood will be chosen as the label and input it to the Search API.Then it returns tweets con- of the test tweet. taining both wi and":)"with their posting time.After sum- We use ci and c2 to denote the two language models.In marization,we get the number of tweets nw;and the time polarity classification,c is the language model for positive range of these tweets twi.Then we build another query":)" and get the number of returned tweets ns and the time range tweets and c2 is for negative tweets.In subjectivity classifi- ts.Some estimations'show that a tweet contains 15 words cation,c is for subjective class and c>is for objective (neu- tral)class.In order to classify a tweet t to ci or c2,we need on average. to estimate the tweet likelihoods computed by P(tc)and Assume that the tweets on Twitter are uniformly dis- P(tc2).By using the common unigram assumption,we get: tributed with respect to time.Similar to the rule of getting Pa(wilc),we can estimate Pu(wilc1)with the following rule: P(tlc)= P(wilc) RD生 nwi x ts Pu(wlc）= twi where n is the number of words in tweet t and P(wilc)is a 器×i5=1巧xtm,×n multinomial distribution estimated from the LM of class c. The term T4 is roughly the number of times word wi ap- tD生 This probability simulates the generative process of the test pearing in class c per unit time,and the term s x 15 is tweet.Firstly,the first word(w)is generated by following roughly the total number of words in class c per unit time. a multinomial distribution P(wic).After that,the second LetFP.()be the normalization factor word is generated independently of the previous word by where V is the size of vocabulary containing both seen and following the same distribution.This process continues until unseen words.Then each estimated P(wic)should be nor- all the words in this tweet have been generated. malized to make them sum up to one: One commonly used method to estimate the distributions is maximum likelihood estimate(MLE),which computes Pu(Wilc):=Pu(wilc)/Fu= P.(wilc) the probability as follows: ∑当Pu,回 R闪=是 n心Xt8 n四 15×tw4×ns t ∑gnox。 where Ni.c is the number of times word wi appearing in j=115xt×n 兴哥 training data of class c and Ne is the total number of words We can find that there is no need to get ts and ns,because in training data of class c. Pu(wilc)can be determined only by nwi and twi. In general,the vocabulary is determined by the training For the LM of negative class,we assume that the negative set.To classify tweets in test set,it is very common to en- tweets are those containing":(".The estimate procedure for counter words that do not appear in training set especially P(wilc2)is similar to that for P(wilc).The only differ- when there are not enough training data or the words are ence is that the query should be changed to"wi:(" not well-formed.In these cases,smoothing (Zhai and Laf- ferty 2004)plays a very important role in language mod- Subjectivity Classification For subjectivity classification, els because it can avoid assigning zero probability to un- the two classes are subjective and objective.The assump- seen words.Furthermore,smoothing can make the model tion for subjective tweets is that tweets with":)"or":(are more accurate and robust.Representative smoothing meth- assumed to carry subjectivity of the users.So we build the ods include Dirichlet smoothing and Jelinek-Mercer(JM) query "w:)OR:("for the subjective class. smoothing(Zhai and Lafferty 2004).Although the original As for the objective LM,getting P(wilc2),the probabil- JM smoothing method is used to linear interpolation of the ity of wi in objective class,is much more challenging than MLE model with the collection model(Zhai and Lafferty that in subjective class.To the best of our knowledge,no 2004),we use JM smoothing method to linearly interpolate general assumption for objective tweets has been reported by the MLE model with the emoticon model in this paper. researchers.We tried the strategy which treats tweets with- out emoticons as objective but the experiments showed that Emoticon Model 4https://dev.twitter.com/docs/using-search From the emoticon data.we can also build the LMs for dif- shttp://blog.oup.com/2009/06/ ferent classes.We propose a very effective and efficient way oxford-twitter/

TSA is actually a classification problem. To adapt LM for TSA, we concatenate all the tweets from the same class to form one synthetic document. Hence, for the polarity classification problem, one document is constructed from positive training tweets, and the other document is constructed from negative training tweets. Then we learn two LMs, one for positive class and the other for negative class. The LM learning procedure for subjectivity classification is similar. During the test phase, we treat each test tweet as a query, and then we can use the likelihoods to rank the classes. The class with the highest likelihood will be chosen as the label of the test tweet. We use c1 and c2 to denote the two language models. In polarity classification, c1 is the language model for positive tweets and c2 is for negative tweets. In subjectivity classifi- cation, c1 is for subjective class and c2 is for objective (neutral) class. In order to classify a tweet t to c1 or c2, we need to estimate the tweet likelihoods computed by P(t|c1) and P(t|c2). By using the common unigram assumption, we get: P(t|c) = Yn i=1 P(wi |c), where n is the number of words in tweet t and P(wi |c) is a multinomial distribution estimated from the LM of class c. This probability simulates the generative process of the test tweet. Firstly, the first word (w1) is generated by following a multinomial distribution P(wi |c). After that, the second word is generated independently of the previous word by following the same distribution. This process continues until all the words in this tweet have been generated. One commonly used method to estimate the distributions is maximum likelihood estimate (MLE), which computes the probability as follows: Pa(wi |c) = Ni,c Nc , where Ni,c is the number of times word wi appearing in training data of class c and Nc is the total number of words in training data of class c. In general, the vocabulary is determined by the training set. To classify tweets in test set, it is very common to encounter words that do not appear in training set especially when there are not enough training data or the words are not well-formed. In these cases, smoothing (Zhai and Lafferty 2004) plays a very important role in language models because it can avoid assigning zero probability to unseen words. Furthermore, smoothing can make the model more accurate and robust. Representative smoothing methods include Dirichlet smoothing and Jelinek-Mercer (JM) smoothing (Zhai and Lafferty 2004). Although the original JM smoothing method is used to linear interpolation of the MLE model with the collection model (Zhai and Lafferty 2004), we use JM smoothing method to linearly interpolate the MLE model with the emoticon model in this paper. Emoticon Model From the emoticon data, we can also build the LMs for different classes. We propose a very effective and efficient way to estimate the emoticon LM Pu(wi |c) from Twitter Search API. Twitter Search API 4 is a dedicated API for running searches against the real-time index of recent tweets. Its index includes tweets between 6-9 days. Given a query which consists of one or several words, the API returns up to 1500 relevant tweets and their posting time. Polarity Classification To get Pu(wi |c1), the probability of wi in positive class, we make an assumption that all tweets containing “:)” are positive. We build a query “wi :)” and input it to the Search API. Then it returns tweets containing both wi and “:)” with their posting time. After summarization, we get the number of tweets nwi and the time range of these tweets twi . Then we build another query “:)” and get the number of returned tweets ns and the time range ts. Some estimations 5 show that a tweet contains 15 words on average. Assume that the tweets on Twitter are uniformly distributed with respect to time. Similar to the rule of getting Pa(wi |c), we can estimate Pu(wi |c1) with the following rule: Pu(wi |c1) = nwi twi ns ts × 15 = nwi × ts 15 × twi × ns . The term nwi twi is roughly the number of times word wi appearing in class c per unit time, and the term ns ts × 15 is roughly the total number of words in class c per unit time. Let Fu = P|V | j=1 Pu(wj |c) be the normalization factor where |V | is the size of vocabulary containing both seen and unseen words. Then each estimated Pu(wi |c) should be normalized to make them sum up to one: Pu(wi |c) := Pu(wi |c)/Fu = Pu(wi |c) P|V | j=1 Pu(wj |c) = nwi×ts 15×twi×ns P|V | j=1 nwj×ts 15×twj×ns = nwi twi P|V | j=1 nwj twj . We can find that there is no need to get ts and ns, because Pu(wi |c) can be determined only by nwi and twi . For the LM of negative class, we assume that the negative tweets are those containing “:(”. The estimate procedure for Pu(wi |c2) is similar to that for Pu(wi |c1). The only difference is that the query should be changed to “wi :(”. Subjectivity Classification For subjectivity classification, the two classes are subjective and objective. The assumption for subjective tweets is that tweets with “:)” or “:(” are assumed to carry subjectivity of the users. So we build the query “wi :) OR :(” for the subjective class. As for the objective LM, getting Pu(wi |c2), the probability of wi in objective class, is much more challenging than that in subjective class. To the best of our knowledge, no general assumption for objective tweets has been reported by researchers. We tried the strategy which treats tweets without emoticons as objective but the experiments showed that 4https://dev.twitter.com/docs/using-search 5http://blog.oup.com/2009/06/ oxford-twitter/

the results were not satisfactory.which implies that this as- Digits.All Digits in tweets are replaced with "twitter- sumption is unreasonable.(Kouloumpis,Wilson,and Moore digit” 2011)tries to use some hashtags like"#jobs"as indicators Links.All urls in tweets are replaced with"twitterurl" for objective tweets.However,this assumption is not gen- eral enough because the number of tweets containing spe- ●Stopwords.Stopwords like"the”and“to”are removed. cific hashtags is limited and these tweets'sentiment may be Lower case and Stemming.All words are changed to their biased to certain topics like“jobs”. lower cases and stemmed to terms. Here we present a novel assumption for objective tweets that tweets containing an objective url link is assumed to be Retweets and Duplicates.Retweets and duplicate tweets objective.Based on our observation,we find that urls linking are removed to avoid giving extra weight to these tweets to the picture sites (e.g.,twitpic.com)or video sites (e.g., in training data. youtube.com)are often subjective and other urls like those linking to news articles are usually objective.Hence,if a Evaluation Scheme and Metrics url link doesn't represent pictures or videos,we call it an After removing the retweets or duplicates and setting the objective url link.Based on the above assumption,we build classes to be balanced,we randomly choose 956 tweets for the query"wfilter:links"6 to get the statistics about the polarity classification,including 478 positive tweets and 478 objective class. negative ones.For the subjectivity classification,we also set the classes to be balanced and randomly choose 1948 tweets ESLAM for evaluation,including 974 subjective tweets and 974 ob- After we have estimated the Pa(walc)from manually la- jective (neutral)ones. beled data and P(wilc)from the noisy emoticon data The evaluation schemes for both polarity and subjectivity we can integrate them into the same probabilistic frame- classification are similar.Assume the total number of man- work Pco(wilc).Before combining Pa(wilc)and Pu(wilc). ually labeled tweets,including both training and test data, there's another important step:smoothing P(wilc).Be- is X.Each time we randomly sample the same amount of cause P(wic)is estimated from noisy emoticon data,it can tweets (say Y)for both classes (e.g.,positive and negative) be biased.We adopt Dirichlet smoothing(Zhai and Lafferty for training,and use the rest X-2Y tweets for test.This 2004)to smooth Pu(wilc). random selection and testing is carried out 10 rounds inde- By following the JM smoothing principle(Zhai and Laf- pendently for each unique training set size,and the average ferty 2004),our ESLAM model Peo(wilc)can be computed performance is reported.We perform experiments with dif- as follows: ferent sizes of training set,i.e.,Y is set to different values, such as 32,64.and 128. Pco(lc）=aPa(ilc）+(1-a)Pu(ilc, (1) As in(Go,Bhayani,and Huang 2009)and (Kouloumpis, where a E[0,1]is the combination parameter controlling Wilson,and Moore 2011),we adopt accuracy and F-score the contribution of each component. as our evaluation metrics.Accuracy is a measure of what percentage of test data are correctly predicted,and F-score Experiments is computed by combining precision and recall. Data Set Effect of Emoticons The publicly available Sanders Corpus?is used for evalu- ation.It consists of 5513 manually labeled tweets.These We compare our ESLAM method to the fully supervised tweets were collected with respect to one of the four dif- language model(LM)to verify whether the smoothing with ferent topics(Apple,Google,Microsoft,and Twitter).After emoticons is useful or not.Please note that the fully super- removing the non-English and spam tweets,we have 3727 vised LM uses only the manually labeled data for training tweets left.The detailed information of the corpus is shown while ESLAM integrates both manually labeled data and the in Table 1.As for the noisy emoticon data,theoretically we emoticon data for training.Figure 1 and Figure 2 respec- use all the data existing in Twitter by sampling with its API. tively illustrate the accuracy and F-score of the two methods with different number of manually labeled training data,i.e., 2Y=32.64,128,256,512,768. Table 1:Corpus Statistics From Figure 1 and Figure 2,we can see that as the num- Corpus Positive Negative Neutral #Total ber of manually labeled data increases,the performance of both methods will also increase,which is reasonable because Sanders 570 654 2503 3727 the manually labeled data contain strong discriminative in- formation.Under all the evaluation settings,ESLAM con- We adopt the following strategies to preprocess the data: sistently outperforms the fully supervised LM,in particular Username.Twitter usernames which start with are re- for the settings with small number of manually labeled data. placed with"twitterusername" This implies that the noisy emoticon data do have some use- ful information and our ESLAM can effectively exploit it to 6filter:links means returning tweets containing urls. achieve good performance 7http://www.sananalytics.com/lab/ Figure 3 and Figure 4 demonstrate the accuracy and F- twitter-sentiment/ score of the two methods on subjectivity classification with

the results were not satisfactory, which implies that this assumption is unreasonable. (Kouloumpis, Wilson, and Moore 2011) tries to use some hashtags like “#jobs” as indicators for objective tweets. However, this assumption is not general enough because the number of tweets containing specific hashtags is limited and these tweets’ sentiment may be biased to certain topics like “jobs”. Here we present a novel assumption for objective tweets that tweets containing an objective url link is assumed to be objective. Based on our observation, we find that urls linking to the picture sites (e.g., twitpic.com) or video sites (e.g., youtube.com) are often subjective and other urls like those linking to news articles are usually objective. Hence, if a url link doesn’t represent pictures or videos, we call it an objective url link. Based on the above assumption, we build the query “wif ilter : links” 6 to get the statistics about the objective class. ESLAM After we have estimated the Pa(wi |c) from manually labeled data and Pu(wi |c) from the noisy emoticon data, we can integrate them into the same probabilistic framework Pco(wi |c). Before combining Pa(wi |c) and Pu(wi |c), there’s another important step: smoothing Pu(wi |c). Because Pu(wi |c) is estimated from noisy emoticon data, it can be biased. We adopt Dirichlet smoothing (Zhai and Lafferty 2004) to smooth Pu(wi |c). By following the JM smoothing principle (Zhai and Lafferty 2004), our ESLAM model Pco(wi |c) can be computed as follows: Pco(wi |c) = αPa(wi |c) + (1 − α)Pu(wi |c), (1) where α ∈ [0, 1] is the combination parameter controlling the contribution of each component. Experiments Data Set The publicly available Sanders Corpus7 is used for evaluation. It consists of 5513 manually labeled tweets. These tweets were collected with respect to one of the four different topics (Apple, Google, Microsoft, and Twitter). After removing the non-English and spam tweets, we have 3727 tweets left. The detailed information of the corpus is shown in Table 1. As for the noisy emoticon data, theoretically we use all the data existing in Twitter by sampling with its API. Table 1: Corpus Statistics Corpus # Positive # Negative # Neutral # Total Sanders 570 654 2503 3727 We adopt the following strategies to preprocess the data: • Username. Twitter usernames which start with @ are replaced with “twitterusername”. 6 filter:links means returning tweets containing urls. 7http://www.sananalytics.com/lab/ twitter-sentiment/ • Digits. All Digits in tweets are replaced with “twitterdigit”. • Links. All urls in tweets are replaced with “twitterurl”. • Stopwords. Stopwords like “the” and “to” are removed. • Lower case and Stemming. All words are changed to their lower cases and stemmed to terms. • Retweets and Duplicates. Retweets and duplicate tweets are removed to avoid giving extra weight to these tweets in training data. Evaluation Scheme and Metrics After removing the retweets or duplicates and setting the classes to be balanced, we randomly choose 956 tweets for polarity classification, including 478 positive tweets and 478 negative ones. For the subjectivity classification, we also set the classes to be balanced and randomly choose 1948 tweets for evaluation, including 974 subjective tweets and 974 objective (neutral) ones. The evaluation schemes for both polarity and subjectivity classification are similar. Assume the total number of manually labeled tweets, including both training and test data, is X. Each time we randomly sample the same amount of tweets (say Y ) for both classes (e.g., positive and negative) for training, and use the rest X − 2Y tweets for test. This random selection and testing is carried out 10 rounds independently for each unique training set size, and the average performance is reported. We perform experiments with different sizes of training set, i.e., Y is set to different values, such as 32, 64, and 128. As in (Go, Bhayani, and Huang 2009) and (Kouloumpis, Wilson, and Moore 2011), we adopt accuracy and F-score as our evaluation metrics. Accuracy is a measure of what percentage of test data are correctly predicted, and F-score is computed by combining precision and recall. Effect of Emoticons We compare our ESLAM method to the fully supervised language model (LM) to verify whether the smoothing with emoticons is useful or not. Please note that the fully supervised LM uses only the manually labeled data for training while ESLAM integrates both manually labeled data and the emoticon data for training. Figure 1 and Figure 2 respectively illustrate the accuracy and F-score of the two methods with different number of manually labeled training data, i.e., 2Y = 32, 64, 128, 256, 512, 768. From Figure 1 and Figure 2, we can see that as the number of manually labeled data increases, the performance of both methods will also increase, which is reasonable because the manually labeled data contain strong discriminative information. Under all the evaluation settings, ESLAM consistently outperforms the fully supervised LM, in particular for the settings with small number of manually labeled data. This implies that the noisy emoticon data do have some useful information and our ESLAM can effectively exploit it to achieve good performance. Figure 3 and Figure 4 demonstrate the accuracy and Fscore of the two methods on subjectivity classification with

Fully Supervised LM Fully Supervised LM ESI AM ESLAM Num of Manually Labeled Training Tweets Num Labeled Tralning Tweets Figure 1:Effect of emoticons on accuracy of polarity classifica- Figure 4:Effect of emoticons on F-score of subjectivity classifi- tion. cation. ervised LM blue line corresponds to the performance of distantly super- vised LM.which also corresponds to the case of zero manu- ally labeled data.The red line is the results of ESLAM.We can find that ESLAM achieves better performance than the distantly supervised LM.With the increase of manually la- beled data,the performance gap between them will become larger and larger.This verifies our claim that it is not enough to use only the data of noisy labels for training. Num of Manually Labeled Training Tweets Figure 2:Effect of emoticons on F-score of polarity classification Distantly Supervised LM ★-ESLAM different number of manually labeled training data,respec- tively.The results are similar to those for polarity classi- 76 fication which once again verifies the effectiveness of our 74 ESLAM to utilize the noisy emoticon data.The good per- 72 formance of ESLAM also verifies that our url link based 32 .64 128256512788 method is effective to find objective tweets,which is a big Num of Manually Labeled Training Tweets challenge for most existing distantly supervised methods. Figure 5:Effect of manually labeled data on accuracy of polarity classification. ervised LM Distan Supervised LM 82 ★ESLAM 12 Num of Manually Labeled Training Tweets 128 25 512 8 Figure 3:Effect of emoticons on accuracy of subjectivity classifi- Num of Manually Labeled Training Tweets cation. Figure 6:Effect of manually labeled data on F-score of polarity classification. Effect of Manually Labeled Data Figure 7 and Figure 8 illustrate the accuracy and F-score We compare our ESLAM method to the distantly supervised of the two methods on subjectivity classification with differ- LM to verify whether the manually labeled data can provide ent number of manually labeled training data,respectively extra useful information for classification.Please note that The results are similar to those for polarity classification. the distantly supervised LM uses only the noisy emoticon data for training,while ESLAM integrates both manually Sensitivity to Parameters labeled data and the emoticon data for training. The parameter a in(1)plays a critical role to control the Figure 5 and Figure 6 illustrate the accuracy and F-score contribution between the manually labeled information and of the two methods on polarity classification with different noisy labeled information.To show the effect of this param- number of manually labeled training data,respectively.The eter in detail,we try different values for polarity classifica-

32 64 128 256 512 768 60 65 70 75 80 85 Num of Manually Labeled Training Tweets Accuracy (in %) Fully Supervised LM ESLAM Figure 1: Effect of emoticons on accuracy of polarity classification. 32 64 128 256 512 768 60 65 70 75 80 85 Num of Manually Labeled Training Tweets F−Score (in %) Fully Supervised LM ESLAM Figure 2: Effect of emoticons on F-score of polarity classification. different number of manually labeled training data, respectively. The results are similar to those for polarity classi- fication which once again verifies the effectiveness of our ESLAM to utilize the noisy emoticon data. The good performance of ESLAM also verifies that our url link based method is effective to find objective tweets, which is a big challenge for most existing distantly supervised methods. 32 64 128 256 512 768 68 70 72 74 76 78 80 Num of Manually Labeled Training Tweets Accuracy (in %) Fully Supervised LM ESLAM Figure 3: Effect of emoticons on accuracy of subjectivity classifi- cation. Effect of Manually Labeled Data We compare our ESLAM method to the distantly supervised LM to verify whether the manually labeled data can provide extra useful information for classification. Please note that the distantly supervised LM uses only the noisy emoticon data for training, while ESLAM integrates both manually labeled data and the emoticon data for training. Figure 5 and Figure 6 illustrate the accuracy and F-score of the two methods on polarity classification with different number of manually labeled training data, respectively. The 32 64 128 256 512 768 66 68 70 72 74 76 78 80 Num of Manually Labeled Training Tweets F−Score (in %) Fully Supervised LM ESLAM Figure 4: Effect of emoticons on F-score of subjectivity classifi- cation. blue line corresponds to the performance of distantly supervised LM, which also corresponds to the case of zero manually labeled data. The red line is the results of ESLAM. We can find that ESLAM achieves better performance than the distantly supervised LM. With the increase of manually labeled data, the performance gap between them will become larger and larger. This verifies our claim that it is not enough to use only the data of noisy labels for training. 0 32 64 128 256 512 768 70 72 74 76 78 80 82 84 Num of Manually Labeled Training Tweets Accuracy (in %) Distantly Supervised LM ESLAM Figure 5: Effect of manually labeled data on accuracy of polarity classification. 0 32 64 128 256 512 768 74 76 78 80 82 84 Num of Manually Labeled Training Tweets F−Score (in %) Distantly Supervised LM ESLAM Figure 6: Effect of manually labeled data on F-score of polarity classification. Figure 7 and Figure 8 illustrate the accuracy and F-score of the two methods on subjectivity classification with different number of manually labeled training data, respectively. The results are similar to those for polarity classification. Sensitivity to Parameters The parameter α in (1) plays a critical role to control the contribution between the manually labeled information and noisy labeled information. To show the effect of this parameter in detail, we try different values for polarity classifica-

-Distantly Supervised LM 79 -ESLAM 7 2 64 12825651276 0.2 0.4 0.6 0.8 Num of Manually Labeled Training Tweets alpha Figure 7:Effect of manually labeled data on accuracy of subjec- Figure 10:Effect of the smoothing parameter a with 512 labeled tivity classification. training tweets. 80 -Distantly Supervise LM ★-ESLAM into the same probabilistic framework.Experiments on real data sets show that our ESLAM method can effectively inte- grate both kinds of data to outperform those methods using only one of them 19 Our ESLAM method is general enough to integrate other kinds of noisy labels for model training,which will be pur- sued in our future work. Num Figure 8:Effect of manually labeled data on F-score of subjectiv- Acknowledgments ity classification. This work is supported by the NSFC (No.61100125)and the 863 Program of China (No.2011AA01A202.No.2012AA011003). tion.Figure 9 and Figure 10 show the accuracy of ESLAM References with 128 and 512 labeled training tweets,respectively. Barbosa,L.,and Feng,J.2010.Robust sentiment detection The case a =0 means only noisy emoticon data are used on twitter from biased and noisy data.In COLING,36-44. and a =1 is the fully supervised case.The results in the Bermingham,A.,and Smeaton,A.F.2010.Classifying Figures clearly show that the best strategy is to integrate both sentiment in microblogs:is brevity an advantage?In CIKM, manually labeled data and noisy data into training.We also 1833-1836. notice that with 512 labeled training data ESLAM achieves its best performance with relatively bigger o than the case Davidov,D.;Tsur,O.;and Rappoport,A.2010.Enhanced of 128 labeled data,which is obviously reasonable.Further- sentiment learning using twitter hashtags and smileys.In C0LNG.241-249. more,we find that ESLAM is not sensitive to the small vari- ations in the value of parameter o because the range for o to Go,A.;Bhayani,R.;and Huang,L.2009.Twitter sentiment achieve the better performance is large. classification using distant supervision.Technical report. Guerra,P.H.C.;Veloso,A.;Jr.,W.M.;and Almeida,V. 2011.From bias to opinion:a transfer-learning approach to real-time sentiment analysis.In KDD,150-158. Jansen,B.J.;Zhang,M.;Sobel,K.;and Chowdury,A.2009. Twitter power:Tweets as electronic word of mouth.JASIST 60(11):2169-2188 Jiang,L.;Yu,M.;Zhou,M.;Liu,X.;and Zhao,T.2011. Target-dependent twitter sentiment classification.In ACL, 151-160. 02 0.4 06 0.8 alpha Kouloumpis,E.;Wilson,T.;and Moore,J.2011.Twit- Figure 9:Effect of the smoothing parameter a with 128 labeled ter sentiment analysis:The good the bad and the omg!In training tweets. ICWSM.538-541. Liu,X.;Li,K.;Zhou,M.;and Xiong,Z.2011a.Collective semantic role labeling for tweets with clustering.In I/CAl, Conclusion 1832-1837. Existing methods use either manually labeled data or noisy Liu,X.;Li,K.;Zhou,M.;and Xiong,Z.2011b.Enhanc- labeled data for Twitter sentiment analysis,but few of them ing semantic role labeling for tweets using self-training.In utilize both of them for training.In this paper,we propose AAA/. a novel model,called emoticon smoothed language model Liu,X.;Zhang,S.;Wei,F.;and Zhou,M.2011c.Recogniz- (ESLAM),to seamlessly integrate these two kinds of data ing named entities in tweets.In ACL,359-367

0 32 64 128 256 512 768 73 74 75 76 77 78 79 80 Num of Manually Labeled Training Tweets Accuracy (in %) Distantly Supervised LM ESLAM Figure 7: Effect of manually labeled data on accuracy of subjectivity classification. 0 32 64 128 256 512 768 72 74 76 78 80 Num of Manually Labeled Training Tweets F−Score (in %) Distantly Supervise LM ESLAM Figure 8: Effect of manually labeled data on F-score of subjectivity classification. tion. Figure 9 and Figure 10 show the accuracy of ESLAM with 128 and 512 labeled training tweets, respectively. The case α = 0 means only noisy emoticon data are used and α = 1 is the fully supervised case. The results in the Figures clearly show that the best strategy is to integrate both manually labeled data and noisy data into training. We also notice that with 512 labeled training data ESLAM achieves its best performance with relatively bigger α than the case of 128 labeled data, which is obviously reasonable. Furthermore, we find that ESLAM is not sensitive to the small variations in the value of parameter α because the range for α to achieve the better performance is large. 0 0.2 0.4 0.6 0.8 1 71 72 73 74 75 76 alpha Accuracy (in %) Figure 9: Effect of the smoothing parameter α with 128 labeled training tweets. Conclusion Existing methods use either manually labeled data or noisy labeled data for Twitter sentiment analysis, but few of them utilize both of them for training. In this paper, we propose a novel model, called emoticon smoothed language model (ESLAM), to seamlessly integrate these two kinds of data 0 0.2 0.4 0.6 0.8 1 70 72 74 76 78 80 alpha Accuracy (in %) Figure 10: Effect of the smoothing parameter α with 512 labeled training tweets. into the same probabilistic framework. Experiments on real data sets show that our ESLAM method can effectively integrate both kinds of data to outperform those methods using only one of them. Our ESLAM method is general enough to integrate other kinds of noisy labels for model training, which will be pursued in our future work. Acknowledgments This work is supported by the NSFC (No. 61100125) and the 863 Program of China (No. 2011AA01A202, No. 2012AA011003). References Barbosa, L., and Feng, J. 2010. Robust sentiment detection on twitter from biased and noisy data. In COLING, 36–44. Bermingham, A., and Smeaton, A. F. 2010. Classifying sentiment in microblogs: is brevity an advantage? In CIKM, 1833–1836. Davidov, D.; Tsur, O.; and Rappoport, A. 2010. Enhanced sentiment learning using twitter hashtags and smileys. In COLING, 241–249. Go, A.; Bhayani, R.; and Huang, L. 2009. Twitter sentiment classification using distant supervision. Technical report. Guerra, P. H. C.; Veloso, A.; Jr., W. M.; and Almeida, V. 2011. From bias to opinion: a transfer-learning approach to real-time sentiment analysis. In KDD, 150–158. Jansen, B. J.; Zhang, M.; Sobel, K.; and Chowdury, A. 2009. Twitter power: Tweets as electronic word of mouth. JASIST 60(11):2169–2188. Jiang, L.; Yu, M.; Zhou, M.; Liu, X.; and Zhao, T. 2011. Target-dependent twitter sentiment classification. In ACL, 151–160. Kouloumpis, E.; Wilson, T.; and Moore, J. 2011. Twitter sentiment analysis: The good the bad and the omg! In ICWSM, 538–541. Liu, X.; Li, K.; Zhou, M.; and Xiong, Z. 2011a. Collective semantic role labeling for tweets with clustering. In IJCAI, 1832–1837. Liu, X.; Li, K.; Zhou, M.; and Xiong, Z. 2011b. Enhancing semantic role labeling for tweets using self-training. In AAAI. Liu, X.; Zhang, S.; Wei, F.; and Zhou, M. 2011c. Recognizing named entities in tweets. In ACL, 359–367

点击进入文档下载页（PDF格式）

已到末页，全文结束

点击下载（PDF格式）

浏览记录