正在加载图片...
guarantee a good performance. Related Work More recent works have focused on distantly supervised SA (Pang and Lee 2007)has a long history in natural lan- methods which learn the classifiers from data with noisy la bels such as emoticons and hashtags.The distant supervi- guage processing.Before (Pang,Lee,and Vaithyanathan 2002),almost all methods are partially knowledge-based. sion method (Go,Bhayani,and Huang 2009)uses the emoti- cons like“:)”and“:'as noisy labels for polarity classifica- (Pang,Lee,and Vaithyanathan 2002)shows that machine learning techniques,such as naive Bayes,maximum entropy tion.The basic assumption is that a tweet containing ":) classifiers,and SVM can outperform the knowledge-based is most likely to have a positive emotion and that contain- baselines on movie reviews.After that,the machine learn ing":("is assumed to be negative.Experiments show that ing based methods have become the mainstream for SA. these emoticons do contain some discriminative informa- Earlier works on TSA follow the methods of traditional tion for SA.Hashtags (e.g.,#sucks)or Smileys are used in (Davidov,Tsur,and Rappoport 2010)to identify sentiment SA on normal text forms like movie reviews.These methods types.(Barbosa and Feng 2010)uses the noisy data collected are mainly fully supervised (Jansen et al.2009:Bermingham from some Twitter sentiment detection web sites.such as the and Smeaton 2010)which have been introduced in the Intro- Twitter Sentiment3.(Kouloumpis,Wilson,and Moore 2011) duction section.Most recent works include target-dependent SA based on SVM (Jiang et al.2011).user-level SA based investigates both hashtags and emoticons and finds that com- on social networks (Tan et al.2011),sentiment stream anal- bining both of them can get better performance than using only hashtags.The advantage of these distantly supervised ysis based on association rules(Silva et al.2011),and real- methods is that the labor-intensive manual annotation can time SA(Guerra et al.2011). be avoided and a large amount of training data can be easily Recently,more and more distantly supervised methods built,either from Twitter API or existing web sites.How- are proposed.(Go,Bhayani,and Huang 2009)'s training data consist of tweets with emoticons like ")and":(and ever,due to the noise in the labels,the accuracy of these methods is not satisfactory. they use these emoticons as noisy labels.(Davidov,Tsur, Considering the shortcomings of the fully supervised and and Rappoport 2010)uses 50 Twitter tags and 15 smileys distantly supervised methods,we argue that the best strat- as noisy labels to identify and classify diverse sentiment egy is to utilize both manually labeled data and noisy la- types of tweets.Other methods with noisy labels(Barbosa beled data for training.However,how to seamlessly inte- and Feng 2010;Kouloumpis,Wilson,and Moore 2011)are grate these two different kinds of data into the same learn- also proposed.All these methods cannot handle subjectiv- ing framework is still a challenge.In this paper,we propose ity classification well.Furthermore,these methods need to a novel model,called emoticon smoothed language model crawl all the data and store them in the local disks.This is very inefficient when millions or even billions of tweets are (ESLAM),to handle this challenge.The main contributions of ESLAM are outlined as follows: used because request rate for crawling tweets is limited by Twitter server. ESLAM uses the noisy emoticon data to smooth the lan- Although a lot of TSA methods have been proposed,few guage model trained from manually labeled data.Hence, of them can effectively integrate both manually labeled data ESLAM seamlessly integrate both manually labeled data and noisy labeled data into the same framework,which mo- and noisy labeled data into a probabilistic framework.The tivates our ESLAM work in this paper. large amount of noisy emoticon data gives ESLAM have the power to deal with misspelled words,slang,modal particles,acronyms,and the unforseen test words,which Our Approach cannot be easily handled by fully supervised methods. In this section,first we present how to adapt language mod- els (Manning,Raghavan,and Schutze 2009)for SA.Then Besides the polarity classification,ESLAM can also be we propose a very effective and efficient way to learn the used for subjectivity classification which cannot be han- emoticon model from Twitter API.Finally.we will intro- dled by most existing distantly supervised methods. duce the strategy to seamlessly integrate both manually la- Rather than crawling a large amount of noisy data to lo- beled data and emoticon data into a probabilistic framework cal disks which is a typical choice by existing distantly which is our ESLAM method. supervised methods,we propose an efficient and conve- nient way to directly estimate the word probabilities from Language Models for SA Twitter API without downloading any tweet.This is very Language models (LM)can be either probabilistic or non- promising because it is very expensive in terms of time and storage to download and process large amount of probabilistic.In this paper,we refer to probabilistic lan- tweets. guage models which are widely used in information retrieval and natural language processing (Ponte and Croft 1998; Experiments on real data sets demonstrate that ESLAM Zhai and Lafferty 2004;Manning,Raghavan,and Schutze can effectively integrate both manually labeled data and 2009).A LM assign a probability to a sequence of words.In noisy labeled data to outperform those methods using information retrieval,first we estimate a LM for each doc- only one of them. ument,then we can compute a likelihood measuring how likely a query is generated by each document LM and rank http://twittersentiment.appspot.com/ the documents with respect to the likelihoods.guarantee a good performance. More recent works have focused on distantly supervised methods which learn the classifiers from data with noisy la￾bels such as emoticons and hashtags. The distant supervi￾sion method (Go, Bhayani, and Huang 2009) uses the emoti￾cons like “:)” and “:(” as noisy labels for polarity classifica￾tion. The basic assumption is that a tweet containing “:)” is most likely to have a positive emotion and that contain￾ing “:(” is assumed to be negative. Experiments show that these emoticons do contain some discriminative informa￾tion for SA. Hashtags (e.g., #sucks) or Smileys are used in (Davidov, Tsur, and Rappoport 2010) to identify sentiment types. (Barbosa and Feng 2010) uses the noisy data collected from some Twitter sentiment detection web sites, such as the Twitter Sentiment3 . (Kouloumpis, Wilson, and Moore 2011) investigates both hashtags and emoticons and finds that com￾bining both of them can get better performance than using only hashtags. The advantage of these distantly supervised methods is that the labor-intensive manual annotation can be avoided and a large amount of training data can be easily built, either from Twitter API or existing web sites. How￾ever, due to the noise in the labels, the accuracy of these methods is not satisfactory. Considering the shortcomings of the fully supervised and distantly supervised methods, we argue that the best strat￾egy is to utilize both manually labeled data and noisy la￾beled data for training. However, how to seamlessly inte￾grate these two different kinds of data into the same learn￾ing framework is still a challenge. In this paper, we propose a novel model, called emoticon smoothed language model (ESLAM), to handle this challenge. The main contributions of ESLAM are outlined as follows: • ESLAM uses the noisy emoticon data to smooth the lan￾guage model trained from manually labeled data. Hence, ESLAM seamlessly integrate both manually labeled data and noisy labeled data into a probabilistic framework. The large amount of noisy emoticon data gives ESLAM have the power to deal with misspelled words, slang, modal particles, acronyms, and the unforseen test words, which cannot be easily handled by fully supervised methods. • Besides the polarity classification, ESLAM can also be used for subjectivity classification which cannot be han￾dled by most existing distantly supervised methods. • Rather than crawling a large amount of noisy data to lo￾cal disks which is a typical choice by existing distantly supervised methods, we propose an efficient and conve￾nient way to directly estimate the word probabilities from Twitter API without downloading any tweet. This is very promising because it is very expensive in terms of time and storage to download and process large amount of tweets. • Experiments on real data sets demonstrate that ESLAM can effectively integrate both manually labeled data and noisy labeled data to outperform those methods using only one of them. 3 http://twittersentiment.appspot.com/ Related Work SA (Pang and Lee 2007) has a long history in natural lan￾guage processing. Before (Pang, Lee, and Vaithyanathan 2002), almost all methods are partially knowledge-based. (Pang, Lee, and Vaithyanathan 2002) shows that machine learning techniques, such as naive Bayes, maximum entropy classifiers, and SVM can outperform the knowledge-based baselines on movie reviews. After that, the machine learn￾ing based methods have become the mainstream for SA. Earlier works on TSA follow the methods of traditional SA on normal text forms like movie reviews. These methods are mainly fully supervised (Jansen et al. 2009; Bermingham and Smeaton 2010) which have been introduced in the Intro￾duction section. Most recent works include target-dependent SA based on SVM (Jiang et al. 2011), user-level SA based on social networks (Tan et al. 2011), sentiment stream anal￾ysis based on association rules (Silva et al. 2011), and real￾time SA (Guerra et al. 2011). Recently, more and more distantly supervised methods are proposed. (Go, Bhayani, and Huang 2009)’s training data consist of tweets with emoticons like “:)” and “:(” and they use these emoticons as noisy labels. (Davidov, Tsur, and Rappoport 2010) uses 50 Twitter tags and 15 smileys as noisy labels to identify and classify diverse sentiment types of tweets. Other methods with noisy labels (Barbosa and Feng 2010; Kouloumpis, Wilson, and Moore 2011) are also proposed. All these methods cannot handle subjectiv￾ity classification well. Furthermore, these methods need to crawl all the data and store them in the local disks. This is very inefficient when millions or even billions of tweets are used because request rate for crawling tweets is limited by Twitter server. Although a lot of TSA methods have been proposed, few of them can effectively integrate both manually labeled data and noisy labeled data into the same framework, which mo￾tivates our ESLAM work in this paper. Our Approach In this section, first we present how to adapt language mod￾els (Manning, Raghavan, and Schutze 2009) for SA. Then we propose a very effective and efficient way to learn the emoticon model from Twitter API. Finally, we will intro￾duce the strategy to seamlessly integrate both manually la￾beled data and emoticon data into a probabilistic framework which is our ESLAM method. Language Models for SA Language models (LM) can be either probabilistic or non￾probabilistic. In this paper, we refer to probabilistic lan￾guage models which are widely used in information retrieval and natural language processing (Ponte and Croft 1998; Zhai and Lafferty 2004; Manning, Raghavan, and Schutze 2009). A LM assign a probability to a sequence of words. In information retrieval, first we estimate a LM for each doc￾ument, then we can compute a likelihood measuring how likely a query is generated by each document LM and rank the documents with respect to the likelihoods
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有