正在加载图片...
Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu,Wu-Jun Li,Minyi Guo Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering.Shanghai Jiao Tong University,China liukunlin@sjtu.edu.cn,{liwujun,guo-my}@cs.sjtu.edu.cn Abstract further classified as positive or negative.Hence,two clas- Twitter sentiment analysis (TSA)has become a hot research sifiers are trained for the whole SA process,one is called topic in recent years.The goal of this task is to discover subjectivity classifier,and the other is called polarity classi- the attitude or opinion of the tweets,which is typically fier.Since(Pang,Lee,and Vaithyanathan 2002)formulated formulated as a machine learning based text classification SA as a machine learning based text classification problem, problem.Some methods use manually labeled data to more and more machine learning methods have been pro- train fully supervised models,while others use some noisy posed for SA(Pang and Lee 2007). labels,such as emoticons and hashtags,for model training. Twitter is a popular online micro-blogging service In general,we can only get a limited number of training launched in 2006.Users on Twitter write tweets up to 140 data for the fully supervised models because it is very characters to tell others about what they are doing and think- labor-intensive and time-consuming to manually label the tweets.As for the models with noisy labels,it is hard for ing.According to the some sources,until 2011,there have them to achieve satisfactory performance due to the noise been over 300 million users on Twitter and 300 million new in the labels although it is easy to get a large amount of tweets are generated every day.Because almost all tweets data for training.Hence,the best strategy is to utilize both are public,these rich data offer new opportunities for do- manually labeled data and noisy labeled data for training. ing research on data mining and natural language process- However,how to seamlessly integrate these two different ing(Liu et al.2011a;2011b;2011c;Jiang et al.2011). kinds of data into the same learning framework is still a One way to perform Twitter sentiment analysis (TSA)is challenge.In this paper,we present a novel model,called to directly exploit traditional SA methods (Pang and Lee emoticon smoothed language model (ESLAM).to handle 2007).However,tweets are quite different from other text this challenge.The basic idea is to train a language model based on the manually labeled data,and then use the noisy forms like product reviews and news articles.Firstly,tweets emoticon data for smoothing.Experiments on real data sets are often short and ambiguous because of the limitation of demonstrate that ESLAM can effectively integrate both kinds characters.Secondly,there're more misspelled words,slang, of data to outperform those methods using only one of them. modal particles and acronyms on Twitter because of its ca- sual form.Thirdly,a huge amount of unlabeled or noisy la- Introduction beled data can be easily downloaded through Twitter APL. Therefore,many novel SA methods have been specially de- Sentiment analysis(SA)(Pang and Lee 2007)(also known veloped for TSA.These methods can be mainly divided into as opinion mining)is mainly about discovering"what others two categories:fully supervised methods and distantly su- think"from data such as product reviews and news articles. pervised methods2. On one hand,consumers can seek advices about a product The fully supervised methods try to learn the classi- to make informed decisions in the consuming process.On fiers from manually labeled data.(Jansen et al.2009)uses the other hand,vendors are paying more and more atten- the multinomial Bayes model to perform automatic TSA. tion to online opinions about their products and services. (Bermingham and Smeaton 2010)compares support vector Hence,SA has attracted increasing attention from many re- machine (SVM)and multinomial naive Bayes (MNB)for search communities such as machine learning,data mining, both blog and microblog SA,and finds that SVM outper- and natural language processing.The sentiment of a docu- forms MNB on blogs with long text but MNB outperforms ment or sentence can be positive,negative or neutral.Hence, SA is actually a three-way classification problem.In prac- SVM on microblogs with short text.One problem with the fully supervised methods is that it is very labor-intensive and tice,most methods adopt a two-step strategy for SA (Pang time-consuming to manually label the data and hence the and Lee 2007).In the subjectivity classification step,the tar- training data sets for most methods are often too small to get is classified to be subjective or neutral (objective),and in the polarity classification step,the subjective targets are http://en.wikipedia.org/wiki/Twitter Copyright C)2012,Association for the Advancement of Artificial 2We use the terminology 'distant'as that from(Go,Bhayani, Intelligence (www.aaai.org).All rights reserved. and Huang 2009).Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China liukunlin@sjtu.edu.cn, {liwujun,guo-my}@cs.sjtu.edu.cn Abstract Twitter sentiment analysis (TSA) has become a hot research topic in recent years. The goal of this task is to discover the attitude or opinion of the tweets, which is typically formulated as a machine learning based text classification problem. Some methods use manually labeled data to train fully supervised models, while others use some noisy labels, such as emoticons and hashtags, for model training. In general, we can only get a limited number of training data for the fully supervised models because it is very labor-intensive and time-consuming to manually label the tweets. As for the models with noisy labels, it is hard for them to achieve satisfactory performance due to the noise in the labels although it is easy to get a large amount of data for training. Hence, the best strategy is to utilize both manually labeled data and noisy labeled data for training. However, how to seamlessly integrate these two different kinds of data into the same learning framework is still a challenge. In this paper, we present a novel model, called emoticon smoothed language model (ESLAM), to handle this challenge. The basic idea is to train a language model based on the manually labeled data, and then use the noisy emoticon data for smoothing. Experiments on real data sets demonstrate that ESLAM can effectively integrate both kinds of data to outperform those methods using only one of them. Introduction Sentiment analysis (SA) (Pang and Lee 2007) (also known as opinion mining) is mainly about discovering “what others think” from data such as product reviews and news articles. On one hand, consumers can seek advices about a product to make informed decisions in the consuming process. On the other hand, vendors are paying more and more atten￾tion to online opinions about their products and services. Hence, SA has attracted increasing attention from many re￾search communities such as machine learning, data mining, and natural language processing. The sentiment of a docu￾ment or sentence can be positive, negative or neutral. Hence, SA is actually a three-way classification problem. In prac￾tice, most methods adopt a two-step strategy for SA (Pang and Lee 2007). In the subjectivity classification step, the tar￾get is classified to be subjective or neutral (objective), and in the polarity classification step, the subjective targets are Copyright c 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. further classified as positive or negative. Hence, two clas￾sifiers are trained for the whole SA process, one is called subjectivity classifier, and the other is called polarity classi- fier. Since (Pang, Lee, and Vaithyanathan 2002) formulated SA as a machine learning based text classification problem, more and more machine learning methods have been pro￾posed for SA (Pang and Lee 2007). Twitter is a popular online micro-blogging service launched in 2006. Users on Twitter write tweets up to 140 characters to tell others about what they are doing and think￾ing. According to the some sources 1 , until 2011, there have been over 300 million users on Twitter and 300 million new tweets are generated every day. Because almost all tweets are public, these rich data offer new opportunities for do￾ing research on data mining and natural language process￾ing(Liu et al. 2011a; 2011b; 2011c; Jiang et al. 2011). One way to perform Twitter sentiment analysis (TSA) is to directly exploit traditional SA methods (Pang and Lee 2007). However, tweets are quite different from other text forms like product reviews and news articles. Firstly, tweets are often short and ambiguous because of the limitation of characters. Secondly, there’re more misspelled words, slang, modal particles and acronyms on Twitter because of its ca￾sual form. Thirdly, a huge amount of unlabeled or noisy la￾beled data can be easily downloaded through Twitter API. Therefore, many novel SA methods have been specially de￾veloped for TSA. These methods can be mainly divided into two categories: fully supervised methods and distantly su￾pervised methods2 . The fully supervised methods try to learn the classi- fiers from manually labeled data. (Jansen et al. 2009) uses the multinomial Bayes model to perform automatic TSA. (Bermingham and Smeaton 2010) compares support vector machine (SVM) and multinomial naive Bayes (MNB) for both blog and microblog SA, and finds that SVM outper￾forms MNB on blogs with long text but MNB outperforms SVM on microblogs with short text. One problem with the fully supervised methods is that it is very labor-intensive and time-consuming to manually label the data and hence the training data sets for most methods are often too small to 1http://en.wikipedia.org/wiki/Twitter 2We use the terminology ‘distant’ as that from (Go, Bhayani, and Huang 2009)
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有