正在加载图片...
entiments contained in tweets; (c) MoodLens implements be obtained as c"(t)= arg max, P(c)Il P(wi llc),where an incremental learning scheme to deal with the problems (ci) is the prior probability of cj f the sentiment shift of words and the generation of new In order to validate the performance of the classifier, the rds;(d) MoodLens is capable of real-time tweet process- set of labeled tweets is divided into two sets randomly, in g and classification, and therefore can serve as a real-time cluding training set, denoted as Ttrain and testing set, de- abnormal event monitoring system. The demo of MoodLens noted as Ttest. The fraction of Ttrain, i.e., the fraction of t isnowavailableathttp://goo.gl/8dq65 training data, is denoted as ft= train. In Ttrain, the set of tweets labeled as ci is denoted as Ttrain, similarly, the tweets 2. EMOTICON-BASED METHOD in Ttest of c, is denoted as Ttest. In the testing set, the cor- We have noticed that the graphical emoticons are pop- rectly predicted tweets of c, is denoted as P3. From these ular in Weibo. In recent work [1], it has been found that definitions, we mainly employ three metrics in this paper to the graphical emoticons can convey strong sentiment. They describe the effectiveness of the classifier, which are listed their mood when post the tweet. as follows. Precision is defined as p= Hence. we could treat these emoticons as sentiment label f the tweets. In fact. it is a kind of crowdsourcing, i.e., the is defined as r= iEj=1 r.. F-measure is defined sers label the tweet with emoticons to express their emo- f=2pr/(p+r) tion themselves. Because of this, categorizing the emoticons In this demo, we use a standard bag of words as the fea- into different sentiments would make the tweets divided into ture, set fe=0.9, P(c)=0.25 and get a Naive Bayes clas- fferent emotion classes. Among over 1000 emoticons, we sifier, its precision is 64.3%, recall is 53.3% and F-measure manually select 95 ones as the sentiment labels(denoted as is58.3% E)and divide them into four different sentiment categories We also present a simple incremental learning approach including angry, disgusting, joyful and sadness. As show in to complement the original Naive Bayes classifier. Here Figure 1, there are 9 emoticons in angry, 14 emoticons in we could assume the tweets in Weibo is a stream, in which disgusting, 50 emoticons in joyful and 22 emoticons in sad there is a fraction of tweets(denoted u)are sentimentally respectively. labeled, then these labeled tweets could be used to update the prior probability of words. To verify the effectiveness of the method, the following experiment is performed. we Sentiment #Emoticons Typical emoticons randomly shuffle T and divide it into 50 pieces of same size. Then we use the first piece as the training set and obtain an Angry initial classifier. For the other 49 pieces, we treat them the tweet stream, which means they enter into the classifier Disgusting1。 one by one, and in each of them, there is a fraction(u) of tweets are randomly selected as labeled tweets and could be Joyful 50 used to update the classifier. As shown in Figure 2, as the index of pieces, denoted as s, grows, the p, r and f of the Sad 回囹 classifier indeed grows. Particularly, higher u means a larger fraction of labeled tweets are used to update the classifier, and then the more updates produce more improvement Figure 1: Sentiment categories and the typical emoticons in each class From Dec. 2010 to Feb. 2011. MoodLens has collected more than 70 million tweets from Weibo, We extract over 3.5 million tweets that contains emoticons in e as the la beled tweets set, denoted as T. It indicates that in Weibo (a)Precision Recall there is nearly 5% of the tweets labeled by the sentiment (c)F-measure emoticons. Finally we obtain 569, 229 angry tweets, 290, 444 weets. These tweets could be used to as a initial sentiment 0, 0.01, 0.05, 0.1 from bottom to top, respectively corpus for Weibo. For each tweet t in T, MoodLens converts it into a sequence of words wi, where w; is a word and In summary, MoodLens employs Naive Bayes classifier is its position in t. with incremental learning to predict the detailed sentiment of the tweets. For other solutions, like Liblinear [4, which nB)to build the classifier, which consumes little training consumes much more training time while gain less than 5%o time and predicts the category fast. From the labeled tweets improvement of precision. Moreover, it is also hard to in- we could obtain the word wi's prior probability of belonging corporate incremental learning approaches into it to the sentiment category c, is P(w ci)=="2u where j= 1, 2, 3 or 4, n(wi) is the times that w app 3. APPLICATIONS in all the tweets in the c, and Laplace smoothing Data Collection Weibo has published its APIs since is used to avoid the prol zero probability. Then we 2010 and through these APIs ould establish the naive lassifier as follows, for a un- tweets and some basic demographic attributes of the users labeled tweet t with word sequence wi, its category could We build a Weibo application named"Are you happy?! andsentiments contained in tweets; (c) MoodLens implements an incremental learning scheme to deal with the problems of the sentiment shift of words and the generation of new words; (d) MoodLens is capable of real-time tweet process￾ing and classification, and therefore can serve as a real-time abnormal event monitoring system. The demo of MoodLens is now available at http://goo.gl/8DQ65. 2. EMOTICON-BASED METHOD We have noticed that the graphical emoticons are pop￾ular in Weibo. In recent work [1], it has been found that the graphical emoticons can convey strong sentiment. They help the users to express their mood when post the tweet. Hence, we could treat these emoticons as sentiment labels of the tweets. In fact, it is a kind of crowdsourcing, i.e., the users label the tweet with emoticons to express their emo￾tion themselves. Because of this, categorizing the emoticons into different sentiments would make the tweets divided into different emotion classes. Among over 1000 emoticons, we manually select 95 ones as the sentiment labels(denoted as E) and divide them into four different sentiment categories, including angry, disgusting, joyful and sadness. As show in Figure 1, there are 9 emoticons in angry, 14 emoticons in disgusting, 50 emoticons in joyful and 22 emoticons in sad, respectively. Figure 1: Sentiment categories and the typical emoticons in each class. From Dec. 2010 to Feb. 2011, MoodLens has collected more than 70 million tweets from Weibo. We extract over 3.5 million tweets that contains emoticons in E as the la￾beled tweets set, denoted as T. It indicates that in Weibo, there is nearly 5% of the tweets labeled by the sentiment emoticons. Finally we obtain 569,229 angry tweets, 290,444 disgusting tweets, 2,218,779 joyful tweets and 607,715 sad tweets. These tweets could be used to as a initial sentiment corpus for Weibo. For each tweet t in T, MoodLens converts it into a sequence of words {wi}, where wi is a word and i is its position in t. In MoodLens, we employ the simple method of Na¨ıve Bayes (NB) to build the classifier, which consumes little training time and predicts the category fast. From the labeled tweets, we could obtain the word wi’s prior probability of belonging to the sentiment category cj is P(wi k cj ) = n cj (wi)+1 P q (n cj (wq)+1) , where j = 1, 2, 3 or 4, n cj (wi) is the times that wi appears in all the tweets in the category cj and Laplace smoothing is used to avoid the problem of zero probability. Then we could establish the Na¨ıve Bayes classifier as follows, for a un￾labeled tweet t with word sequence {wi}, its category could be obtained as c ∗ (t) = arg maxj P(cj )ΠiP(wi k cj ), where P(cj ) is the prior probability of cj . In order to validate the performance of the classifier, the set of labeled tweets is divided into two sets randomly, in￾cluding training set, denoted as Ttrain and testing set, de￾noted as Ttest. The fraction of Ttrain, i.e., the fraction of the training data, is denoted as ft = |Ttrain| |T | . In Ttrain, the set of tweets labeled as cj is denoted as T cj train, similarly, the tweets in Ttest of cj is denoted as T cj test. In the testing set, the cor￾rectly predicted tweets of cj is denoted as P cj . From these definitions, we mainly employ three metrics in this paper to describe the effectiveness of the classifier, which are listed as follows. Precision is defined as p = P4 j=1 |P cj | |Ttest| . Recall is defined as r = 1 4 P4 j=1 |P cj | |T cj test| . F-measure is defined as f = 2pr/(p + r). In this demo, we use a standard bag of words as the fea￾ture, set ft = 0.9, P(cj ) = 0.25 and get a Na¨ıve Bayes clas￾sifier, its precision is 64.3%, recall is 53.3% and F-measure is 58.3%. We also present a simple incremental learning approach to complement the original Na¨ıve Bayes classifier. Here, we could assume the tweets in Weibo is a stream, in which there is a fraction of tweets (denoted u) are sentimentally labeled, then these labeled tweets could be used to update the prior probability of words. To verify the effectiveness of the method, the following experiment is performed. we randomly shuffle T and divide it into 50 pieces of same size. Then we use the first piece as the training set and obtain an initial classifier. For the other 49 pieces, we treat them as the tweet stream, which means they enter into the classifier one by one, and in each of them, there is a fraction(u) of tweets are randomly selected as labeled tweets and could be used to update the classifier. As shown in Figure 2, as the index of pieces, denoted as s, grows, the p, r and f of the classifier indeed grows. Particularly, higher u means a larger fraction of labeled tweets are used to update the classifier, and then the more updates produce more improvements. 0 5 10 15 20 25 30 35 40 45 50 0.65 0.66 0.67 0.68 0.69 0.7 0.71 s p (a) Precision 0 5 10 15 20 25 30 35 40 45 50 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 s r (b) Recall 0 5 10 15 20 25 30 35 40 45 50 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 s f (c) F-measure Figure 2: Experiments of incremental learning, u = 0, 0.01, 0.05, 0.1 from bottom to top, respectively. In summary, MoodLens employs Na¨ıve Bayes classifier with incremental learning to predict the detailed sentiment of the tweets. For other solutions, like Liblinear [4], which consumes much more training time while gain less than 5% improvement of precision. Moreover, it is also hard to in￾corporate incremental learning approaches into it. 3. APPLICATIONS Data Collection Weibo has published its APIs since 2010 and through these APIs, it is easy to obtain the public tweets and some basic demographic attributes of the users. We build a Weibo application named “Are you happy?!” and
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有