Emoticon Smoothed Language Models for_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Emoticon smoothed language models for Twitter sentiment analysis

正在加载图片...

Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu,Wu-Jun Li,Minyi Guo Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering.Shanghai Jiao Tong University,China liukunlin@sjtu.edu.cn,{liwujun,guo-my}@cs.sjtu.edu.cn Abstract further classified as positive or negative.Hence,two clas- Twitter sentiment analysis (TSA)has become a hot research sifiers are trained for the whole SA process,one is called topic in recent years.The goal of this task is to discover subjectivity classifier,and the other is called polarity classi- the attitude or opinion of the tweets,which is typically fier.Since(Pang,Lee,and Vaithyanathan 2002)formulated formulated as a machine learning based text classification SA as a machine learning based text classification problem, problem.Some methods use manually labeled data to more and more machine learning methods have been pro- train fully supervised models,while others use some noisy posed for SA(Pang and Lee 2007). labels,such as emoticons and hashtags,for model training. Twitter is a popular online micro-blogging service In general,we can only get a limited number of training launched in 2006.Users on Twitter write tweets up to 140 data for the fully supervised models because it is very characters to tell others about what they are doing and think- labor-intensive and time-consuming to manually label the tweets.As for the models with noisy labels,it is hard for ing.According to the some sources,until 2011,there have them to achieve satisfactory performance due to the noise been over 300 million users on Twitter and 300 million new in the labels although it is easy to get a large amount of tweets are generated every day.Because almost all tweets data for training.Hence,the best strategy is to utilize both are public,these rich data offer new opportunities for do- manually labeled data and noisy labeled data for training. ing research on data mining and natural language process- However,how to seamlessly integrate these two different ing(Liu et al.2011a;2011b;2011c;Jiang et al.2011). kinds of data into the same learning framework is still a One way to perform Twitter sentiment analysis (TSA)is challenge.In this paper,we present a novel model,called to directly exploit traditional SA methods (Pang and Lee emoticon smoothed language model (ESLAM).to handle 2007).However,tweets are quite different from other text this challenge.The basic idea is to train a language model based on the manually labeled data,and then use the noisy forms like product reviews and news articles.Firstly,tweets emoticon data for smoothing.Experiments on real data sets are often short and ambiguous because of the limitation of demonstrate that ESLAM can effectively integrate both kinds characters.Secondly,there're more misspelled words,slang, of data to outperform those methods using only one of them. modal particles and acronyms on Twitter because of its ca- sual form.Thirdly,a huge amount of unlabeled or noisy la- Introduction beled data can be easily downloaded through Twitter APL. Therefore,many novel SA methods have been specially de- Sentiment analysis(SA)(Pang and Lee 2007)(also known veloped for TSA.These methods can be mainly divided into as opinion mining)is mainly about discovering"what others two categories:fully supervised methods and distantly su- think"from data such as product reviews and news articles. pervised methods2. On one hand,consumers can seek advices about a product The fully supervised methods try to learn the classi- to make informed decisions in the consuming process.On fiers from manually labeled data.(Jansen et al.2009)uses the other hand,vendors are paying more and more atten- the multinomial Bayes model to perform automatic TSA. tion to online opinions about their products and services. (Bermingham and Smeaton 2010)compares support vector Hence,SA has attracted increasing attention from many re- machine (SVM)and multinomial naive Bayes (MNB)for search communities such as machine learning,data mining, both blog and microblog SA,and finds that SVM outper- and natural language processing.The sentiment of a docu- forms MNB on blogs with long text but MNB outperforms ment or sentence can be positive,negative or neutral.Hence, SA is actually a three-way classification problem.In prac- SVM on microblogs with short text.One problem with the fully supervised methods is that it is very labor-intensive and tice,most methods adopt a two-step strategy for SA (Pang time-consuming to manually label the data and hence the and Lee 2007).In the subjectivity classification step,the tar- training data sets for most methods are often too small to get is classified to be subjective or neutral (objective),and in the polarity classification step,the subjective targets are http://en.wikipedia.org/wiki/Twitter Copyright C)2012,Association for the Advancement of Artificial 2We use the terminology 'distant'as that from(Go,Bhayani, Intelligence (www.aaai.org).All rights reserved. and Huang 2009).Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China liukunlin@sjtu.edu.cn, {liwujun,guo-my}@cs.sjtu.edu.cn Abstract Twitter sentiment analysis (TSA) has become a hot research topic in recent years. The goal of this task is to discover the attitude or opinion of the tweets, which is typically formulated as a machine learning based text classification problem. Some methods use manually labeled data to train fully supervised models, while others use some noisy labels, such as emoticons and hashtags, for model training. In general, we can only get a limited number of training data for the fully supervised models because it is very labor-intensive and time-consuming to manually label the tweets. As for the models with noisy labels, it is hard for them to achieve satisfactory performance due to the noise in the labels although it is easy to get a large amount of data for training. Hence, the best strategy is to utilize both manually labeled data and noisy labeled data for training. However, how to seamlessly integrate these two different kinds of data into the same learning framework is still a challenge. In this paper, we present a novel model, called emoticon smoothed language model (ESLAM), to handle this challenge. The basic idea is to train a language model based on the manually labeled data, and then use the noisy emoticon data for smoothing. Experiments on real data sets demonstrate that ESLAM can effectively integrate both kinds of data to outperform those methods using only one of them. Introduction Sentiment analysis (SA) (Pang and Lee 2007) (also known as opinion mining) is mainly about discovering “what others think” from data such as product reviews and news articles. On one hand, consumers can seek advices about a product to make informed decisions in the consuming process. On the other hand, vendors are paying more and more attention to online opinions about their products and services. Hence, SA has attracted increasing attention from many research communities such as machine learning, data mining, and natural language processing. The sentiment of a document or sentence can be positive, negative or neutral. Hence, SA is actually a three-way classification problem. In practice, most methods adopt a two-step strategy for SA (Pang and Lee 2007). In the subjectivity classification step, the target is classified to be subjective or neutral (objective), and in the polarity classification step, the subjective targets are Copyright c 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. further classified as positive or negative. Hence, two classifiers are trained for the whole SA process, one is called subjectivity classifier, and the other is called polarity classi- fier. Since (Pang, Lee, and Vaithyanathan 2002) formulated SA as a machine learning based text classification problem, more and more machine learning methods have been proposed for SA (Pang and Lee 2007). Twitter is a popular online micro-blogging service launched in 2006. Users on Twitter write tweets up to 140 characters to tell others about what they are doing and thinking. According to the some sources 1 , until 2011, there have been over 300 million users on Twitter and 300 million new tweets are generated every day. Because almost all tweets are public, these rich data offer new opportunities for doing research on data mining and natural language processing(Liu et al. 2011a; 2011b; 2011c; Jiang et al. 2011). One way to perform Twitter sentiment analysis (TSA) is to directly exploit traditional SA methods (Pang and Lee 2007). However, tweets are quite different from other text forms like product reviews and news articles. Firstly, tweets are often short and ambiguous because of the limitation of characters. Secondly, there’re more misspelled words, slang, modal particles and acronyms on Twitter because of its casual form. Thirdly, a huge amount of unlabeled or noisy labeled data can be easily downloaded through Twitter API. Therefore, many novel SA methods have been specially developed for TSA. These methods can be mainly divided into two categories: fully supervised methods and distantly supervised methods2 . The fully supervised methods try to learn the classi- fiers from manually labeled data. (Jansen et al. 2009) uses the multinomial Bayes model to perform automatic TSA. (Bermingham and Smeaton 2010) compares support vector machine (SVM) and multinomial naive Bayes (MNB) for both blog and microblog SA, and finds that SVM outperforms MNB on blogs with long text but MNB outperforms SVM on microblogs with short text. One problem with the fully supervised methods is that it is very labor-intensive and time-consuming to manually label the data and hence the training data sets for most methods are often too small to 1http://en.wikipedia.org/wiki/Twitter 2We use the terminology ‘distant’ as that from (Go, Bhayani, and Huang 2009)

向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Emoticon smoothed language models for Twitter sentiment analysis