Expert Systems with Applications 38(2011)5330-5335 Contents lists available at Science Direct Expert Systems with Applications ELSEVIER journalhomepagewww.elsevier.com/locate/eswa a tag-topic model for blog mining Flora s. tsai School of electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore ARTICLE INFO A BSTRACT Keywords Blog mining addresses the problem of mining information from blog data. Although mining blogs may share many similarities to Web and text documents, existing techniques need to be reevaluated and blog dapted for the multidimensional representation of blog data, which exhibit dimensions not present in traditional documents, such as tags. Blog tags are semantic annotations in blogs which can be valuable sources of additional labels for the myriad of blog documents. In this paper, we present a tag-topic model Latent dirichlet allocation for blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topid model determines the most likely tags and words for a given topic in a collection of blog posts. The model has been successfully implemented and evaluated on real-world blog data. e 2010 Elsevier Ltd. All rights reserved. 1 Introduction dimensions. Dimensionality reduction can uncover hidden struc- ture which is useful to understand and visualize of the data A blog, or weblog, is a type of online journal where entries are Previous studies(Chen, Tsai, Chan, 2007: Liang, Tsai, Kwe made in a reverse chronological order. blogs can comment on a 2009: Tsai& Chan, 2007a) use existing data mining techniques particular subject, as well as form of a social network(Tsai, Han, without considering the additional dimensions present in blogs. Xu, Chua, 2009). The blogosphere is defined as the collection of In this paper, we show how blog mining is different from tradi all blogs as a community or social network. Because of the large tional Web and text mining by defining the multiple dimensions numbers ting blog documents(posts) the blogosphere con- tent m feet rn dom an d nd votsicalchetio n tac bn haen 2neneedsd Next. t n boris Fantagi-t ope mpel tot minina the menibipue tags to aid in the analysis and understanding of blog dat Silva, Langford, 2000) dimensionality reduction technique for a tag is a keyword that can be used to describe a blog. The tag visualizing real-world collections of security blogs. metadata is useful for users to quickly find related blog entries that are tagged to a topic of interest. Tags can be chosen by the blogger. A The paper is organized as follows: Section 2 describes past work blog content and tag mining. Section 3 presents the models and the viewer, or both. If many users tag many items, this tag collec- techniques for blog mining, including the proposed tag-topic mod- tion forms a folksonomy. Tagging was popularized by the Web 2.0 el to analyze and visualize the multiple tags present in blog data. and is an important feature of many existing services. Section 4 presents experimental results on real-world blog data, Many blog systems allow bloggers to add new tags to a post, in and Section 5 concludes the paper. addition to placing the post into categories. For example, a post ay display that it has been tagged with"web"andsecurity Each of those tags can link to a main page that lists all of the related 2. Blog content and tag mining posts with the same tag. A sidebar may list all the tags for that blog with each tag leading to an index page. If a post is incorrectly clas- 2. 1. Dimensions of blog documents fied a blogger can edit the list of tags a blog is structured differently from a typical Web or text doc dimensionality reduction or projection techniques to transform ument. Table 1 compares the different components of blog, Web, the data into a smaller set Dimensionality reduction finds a smal- and text documents. URL Stands for the Uniform Resource Locator. ler set of features that can describe the original set of observed the Web address from which a document can be found. a per link is specific to blogs, and is a URl that points to a specific blog entry after the entry has passed from the front page into the blog archives. Outlinks are documents that are linked from the blog or *TeL:+6567906369;fax:+6567933318. Web document. Tags are labels that people use to make it easier to find related blog posts, photos, and videos 0957-4174 front matter o 2010 Elsevier Ltd. All rights reserved. oi:10.1016/eswa2010.10.025
A tag-topic model for blog mining Flora S. Tsai ⇑ School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore article info Keywords: Blog mining Weblog Tags Author-Topic model Isomap Latent Dirichlet Allocation abstract Blog mining addresses the problem of mining information from blog data. Although mining blogs may share many similarities to Web and text documents, existing techniques need to be reevaluated and adapted for the multidimensional representation of blog data, which exhibit dimensions not present in traditional documents, such as tags. Blog tags are semantic annotations in blogs which can be valuable sources of additional labels for the myriad of blog documents. In this paper, we present a tag-topic model for blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topic model determines the most likely tags and words for a given topic in a collection of blog posts. The model has been successfully implemented and evaluated on real-world blog data. 2010 Elsevier Ltd. All rights reserved. 1. Introduction A blog, or weblog, is a type of online journal where entries are made in a reverse chronological order. Blogs can comment on a particular subject, as well as form of a social network (Tsai, Han, Xu, & Chua, 2009). The blogosphere is defined as the collection of all blogs as a community or social network. Because of the large numbers of existing blog documents (posts) the blogosphere content may be random and chaotic (Chen, Tsai, & Chan, 2008). As a result, effective mining and visualization techniques are needed to aid in the analysis and understanding of blog data. A tag is a keyword that can be used to describe a blog. The tag metadata is useful for users to quickly find related blog entries that are tagged to a topic of interest. Tags can be chosen by the blogger, the viewer, or both. If many users tag many items, this tag collection forms a folksonomy. Tagging was popularized by the Web 2.0 and is an important feature of many existing services. Many blog systems allow bloggers to add new tags to a post, in addition to placing the post into categories. For example, a post may display that it has been tagged with ‘‘web’’ and ‘‘security’’. Each of those tags can link to a main page that lists all of the related posts with the same tag. A sidebar may list all the tags for that blog, with each tag leading to an index page. If a post is incorrectly classified, a blogger can edit the list of tags. Analysis of large data of multiple tags may require the use of dimensionality reduction or projection techniques to transform the data into a smaller set. Dimensionality reduction finds a smaller set of features that can describe the original set of observed dimensions. Dimensionality reduction can uncover hidden structure which is useful to understand and visualize of the data. Previous studies (Chen, Tsai, & Chan, 2007; Liang, Tsai, & Kwee, 2009; Tsai & Chan, 2007a) use existing data mining techniques without considering the additional dimensions present in blogs. In this paper, we show how blog mining is different from traditional Web and text mining by defining the multiple dimensions in blog documents, and comparing to Web and text documents. Next, we describe a tag-topic model for mining the multiple tags present in blogs. Finally, we implement Isomap (Tenenbaum, de Silva, & Langford, 2000) dimensionality reduction technique for visualizing real-world collections of security blogs. The paper is organized as follows: Section 2 describes past work in blog content and tag mining. Section 3 presents the models and techniques for blog mining, including the proposed tag-topic model to analyze and visualize the multiple tags present in blog data. Section 4 presents experimental results on real-world blog data, and Section 5 concludes the paper. 2. Blog content and tag mining 2.1. Dimensions of blog documents A blog is structured differently from a typical Web or text document. Table 1 compares the different components of blog, Web, and text documents. URL stands for the Uniform Resource Locator, the Web address from which a document can be found. A permalink is specific to blogs, and is a URL that points to a specific blog entry after the entry has passed from the front page into the blog archives. Outlinks are documents that are linked from the blog or Web document. Tags are labels that people use to make it easier to find related blog posts, photos, and videos. 0957-4174/$ - see front matter 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.10.025 ⇑ Tel.: +65 6790 6369; fax: +65 6793 3318. E-mail address: fst1@columbia.edu Expert Systems with Applications 38 (2011) 5330–5335 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
F.S. Tsai/ Expert Systems with Applications 38(2011)5330-5335 Table that a user may wish to subscribe. As many blog posts are inher of blog, Web, and text documents. ently noisy, finding the relevant feeds is not a trivial problem. Web Text Blog tag mining Content √√√ a blog tag is a word that categorizes documents according to i topic. Blog tag mining is a subset of social media tag mining Social media sites, such as Flickr, MySpace, and del icio us, allow users to semantically annotate many different types of content. These user generated tags classifies content so they can be easily found. Because blog tags are typically user-generated different users may use different tags to describe a similar blog. there is also a lack of information about the meaning of each tag. For example, the tag apple"could refer to either the fruit or the company. The person If we consider the different components of blogs, we can group alized variety of vulnerable finding comprehensive information general blog data mining into five main dimensions(blog content, about a subject. Our proposed model attempts to solve some of tags, authors, links, and time). shown in Table 2. the difficulties of blog tag mining by applying probabilistic and The next sections defines and summarizes blog content and tag dimensionality reduction techniques, which can reduce the noise 2. Blog content mini 3. Models and techniques for blog mining Blog content consists of the title and content of the blog docu ments. Many of the techniques are similar to text and Web docu- In this section, we propose and apply probabilistic models and nents: however important distinctions that pose challenges in u mensionality reduction techniques for analyzing and visualizing atural language processing include common use of abbreviations tended for different categories of multidimensional data, such as languages present within one document. Dirichlet Allocation( Blei, Ng. Jordan, 2003), a modified version Many blog content mining techniques focuses on sentiment or of the Author-Topic model, and Isomap dimensionality reduction opinion mining, or judging whether a particular blog post is nega- algorithm. tive, positive, or neutral to a particular entity(such as a person or product). In fact, one of the main tasks in the Text Retrieval Confer 3.1 Latent Dirichlet allocation ence(trec) Blog Track was the Opinion Retrieval Task, which in olved locating blog posts that express an opinion about a given Latent Dirichlet Allocation(LDA)(Blei et al., 2003)models text target(Ounis, de Rijke, Macdonald, Mishne, Soboroff, 2006: Oun- documents as mixtures of latent topics, which are key concepts is,Macdonald,&Soboroff, 2008:Macdonald, Ounis, Soboroff, presented in the text LDA is not as vulnerable to overfitting as tra- ditional methods based on Latent Semantic Analysis(LSA)(Chen Another prevalent theme in blog content mining is the filtering et al. 2008: Deerwester. Dumais. Furnas. Landauer. Harshman of spam blogs, or splogs, which can greatly misrepresent any esti 0) mations of the number of blogs posted. Previous work in splog detection include splog detection using self-similarity analysis or The topic mixture is drawn from a conjugate dirichlet prior that the same for all documents. The steps adapted for blog docu- blog temporal dynamics(Lin, Sundaram, Chi, Tatemura, Tseng, ments are summarized below 2007), using Support Vector Machines (SVMs) to identify and splogs(Kolari, Finin, Joshi, 2006). Yet another important task in blog content mining is topic dis- (1) Select a multinomial distribution r for each topic t from a tillation, which was the second main task in TREC Blog 2007(Mac Dirichlet distribution with parameter B (2)For each blog document b, select a multinomial distribution donald et al. 2007) and 2008(Ounis et al., 2008). The blog distillation, or feed search, task focuses on blog feeds, which are eb from a Dirichlet distribution with parameter a aggregates of blog posts. Blog distillation task searches for a blog (3)For each word token w in blog b, select a topic t from Ob. 4) Select a word w from feed with a principle, recurring interest in topic t For a given topic t, systems should suggest feeds that are principally devoted to t over the timespan of the feed and would be recommended to sub The probability of generating a corpus is scribe to as an interesting feed about t(Macdonald et al 2007) This task has direct relevance to the problem of searching for blogs P(PlB)II P(ebla) able 2 3. 2. Topic-tag model blog dimensions imensions Blog components An extension of LDa to probabilistic Author-Topic (AT) model- Content Title and content ing(Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004; Steyvers, Smyth, ags(labels or Rosen- Zvi, Griffiths, 2004) is proposed for the blog tag and topic visualization. The AT model is based on Gibbs sampling, a Markov chain monte Carlo technique, where each author is represented by Links a probability distribution over topics, and each topic is represente Date and time as a probability distribution over terms (words)for that topic (Steyvers et al, 2004)
If we consider the different components of blogs, we can group general blog data mining into five main dimensions (blog content, tags, authors, links, and time), shown in Table 2. The next sections defines and summarizes blog content and tag mining techniques. 2.2. Blog content mining Blog content consists of the title and content of the blog documents. Many of the techniques are similar to text and Web documents; however important distinctions that pose challenges in natural language processing include common use of abbreviations and slang words, spelling and grammatical errors, and different languages present within one document. Many blog content mining techniques focuses on sentiment or opinion mining, or judging whether a particular blog post is negative, positive, or neutral to a particular entity (such as a person or product). In fact, one of the main tasks in the Text Retrieval Conference (TREC) Blog Track was the Opinion Retrieval Task, which involved locating blog posts that express an opinion about a given target (Ounis, de Rijke, Macdonald, Mishne, & Soboroff, 2006; Ounis, Macdonald, & Soboroff, 2008; Macdonald, Ounis, & Soboroff, 2007). Another prevalent theme in blog content mining is the filtering of spam blogs, or splogs, which can greatly misrepresent any estimations of the number of blogs posted. Previous work in splog detection include splog detection using self-similarity analysis on blog temporal dynamics (Lin, Sundaram, Chi, Tatemura, & Tseng, 2007), using Support Vector Machines (SVMs) to identify and splogs (Kolari, Finin, & Joshi, 2006). Yet another important task in blog content mining is topic distillation, which was the second main task in TREC Blog 2007 (Macdonald et al., 2007) and 2008 (Ounis et al., 2008). The blog distillation, or feed search, task focuses on blog feeds, which are aggregates of blog posts. Blog distillation task searches for a blog feed with a principle, recurring interest in topic t. For a given topic t, systems should suggest feeds that are principally devoted to t over the timespan of the feed, and would be recommended to subscribe to as an interesting feed about t (Macdonald et al., 2007). This task has direct relevance to the problem of searching for blogs that a user may wish to subscribe. As many blog posts are inherently noisy, finding the relevant feeds is not a trivial problem. 2.3. Blog tag mining A blog tag is a word that categorizes documents according to its topic. Blog tag mining is a subset of social media tag mining. Social media sites, such as Flickr, MySpace, and del.icio.us, allow users to semantically annotate many different types of content. These usergenerated tags classifies content so they can be easily found. Because blog tags are typically user-generated different users may use different tags to describe a similar blog. There is also a lack of information about the meaning of each tag. For example, the tag ‘‘apple’’ could refer to either the fruit or the company. The personalized variety of vulnerable finding comprehensive information about a subject. Our proposed model attempts to solve some of the difficulties of blog tag mining by applying probabilistic and dimensionality reduction techniques, which can reduce the noise in blog tags. 3. Models and techniques for blog mining In this section, we propose and apply probabilistic models and dimensionality reduction techniques for analyzing and visualizing the multiple tags present in blog data. This model can easily be extended for different categories of multidimensional data, such as other types of social media. The techniques are based on Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003), a modified version of the Author-Topic model, and Isomap dimensionality reduction algorithm. 3.1. Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) (Blei et al., 2003) models text documents as mixtures of latent topics, which are key concepts presented in the text. LDA is not as vulnerable to overfitting as traditional methods based on Latent Semantic Analysis (LSA) (Chen et al., 2008; Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990). The topic mixture is drawn from a conjugate Dirichlet prior that is the same for all documents. The steps adapted for blog documents are summarized below: (1) Select a multinomial distribution /t for each topic t from a Dirichlet distribution with parameter b. (2) For each blog document b, select a multinomial distribution hb from a Dirichlet distribution with parameter a. (3) For each word token w in blog b, select a topic t from hb. (4) Select a word w from /t. The probability of generating a corpus is: Z Z YK t¼1 Pð/tjbÞ YN b¼1 PðhbjaÞ YNb i¼1 XK ti¼1 PðtijhÞPðwijt;/Þ !dhd/ ð1Þ 3.2. Topic-tag model An extension of LDA to probabilistic Author-Topic (AT) modeling (Rosen-Zvi, Griffiths, Steyvers, & Smyth, 2004; Steyvers, Smyth, Rosen-Zvi, & Griffiths, 2004) is proposed for the blog tag and topic visualization. The AT model is based on Gibbs sampling, a Markov chain Monte Carlo technique, where each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over terms (words) for that topic (Steyvers et al., 2004). Table 1 Comparison of blog, Web, and text documents. Components Blog Web Text Title p p Content ppp Tags p Author p URL p p Permalink p Outlinks p p Time p Date p Table 2 Blog dimensions. Dimensions Blog components Content Title and content Tags Tags (labels or categories) Author Author or blogger Links URL, permalink, outlinks Time Date and time F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335 5331
TSai/ Expert Systems with Applications 38(2011)5330-5335 le have extended the AT model for analysis of blog tags For 3)Apply MDs to matrix of graph distances, constructing an the tag-topic (Tr) model, each tag is represented by a probabili embedding of the data in a d-dimensional Euclidean space distribution over topics, and each topic represented by a probabil Y that best preserves the manifolds estimated intrinsic ty distribution over terms for that topic. geometry(Tenenbaum et al., 2000). Fig 1 shows the generative model of the Tt m notation If two points appear on a nonlinear manifold, their Euclidean For the Tr model, the probability of generating a blog is giver distance in the high-dimensional input space may not accurately reflect their intrinsic similarity. The geodesic distance along the I-22om. low-dimensional manifold is thus a better representation for these 2) points. The neighborhood graph G constructed in the first step of allows an estimation of the true geodesic path to be computed effi ciently in step two, as the shortest path in G. The two-dimensiona and 0 and their Dirichlet distributions and sampled using the Gibbs embedding recovered by isomap in step three, which best pre- sampling Monte Carlo technique he similarity matrices for tags and content can then be calcu The embedding now represents simpler and cleaner approxim. lated using the symmetrized Kullback Leibler(KL) distance be- tions to the true geodesic paths than do the corresponding graph tween topic distributions, which is able to measure the paths(Tenenbaum et al., 2000). somap is a very useful noniterative, polynomial-time algorithm difference between two probability distributions. The similarity for nonlinear dimensionality reduction. Isomap is able to compute ices can be visualized using the Isomap dimensionality tech described in the following section. a globally optimal solution, and for a certain class of data manifolds (Swiss roll), is guaranteed to converge asymptotically to the true 3.3. Isometric feature mapping(Isomap) structure(Tenenbaum et al., 2000). However, Isomap may not eas- lly handle more complex domains such as non-trivial curvature or Isomap(Tenenbaum et al, 2000) is a nonlinear dimensionali opology. Because a previous study showed that Isomap was gen soave on n technique that uses multidimensional scaling(MDS) as real-world data (Tsai Chan, 2007b), we have applied Isomap 2000)techniques with geodesic interpoint distances in- for visualizing blog content and tags Euclidean distances. Geodesic distances represent the paths along the curved surface of the manifold. Unlike the linear techniques, Isomap can discover the nonlinear degrees 4 Experiments and results baum et al., 2000). Ne used the tag-topic model for Isomap deals with finite data sets of points in R" which are as- tion of real-world blog data. Dimer sumed to lie on a smooth submanifold Ma of low dimension d <n. formed with Isomap to show the nensionality mining on our collec- y reduction was per plot of blog content The algorithm attempts to recover M given only the data points. and tags. Experiments show that the tag-topic model can reveal omap estimates the unknown geodesic distance in M between interesting patterns in the underlying tags and topics for our data- data points in terms of the graph distance with respect to some set of security-related blogs graph G constructed on the data points. Isomap algorithm consists of three basic steps: 4. 1. Data corpus (1) Find the nearest neighbors on the manifold M, based on the distances between pairs of points in the input space For our experiments, we extracted a subset of the Nielson Buz (2)Approximate the geodesic distances between all pairs of threats and incidents related to cyber crime and computer virus distances in the graph G. Nielsen BuzzMetrics for May 2006. Although the blog entries span only a short period of time, they are indicative of the amount and ariety of blog posts that exists in different languages throughout Blog entries related to security threats such as malware, cyber crime, computer virus, encryption, and information security were extracted by keyword search and stored for use in our analysis. There were a total of 3096 entries in our dataset: however most of the blog posts do not have tags associated with them, we eliminated those documents with null or blank tags, as well as hose with tags labeled as"uncategorized". Each of the remaining 948 blog entries was saved as a text file for further text preprocess For the of the blog content, HTML tags were re- stemming, and pruning by the Text to Matrix Generator (TMG (Zeimpekis Gallopoulos, 2006) prior to generating the term-doc- ument matrix using term frequency(TF) local term weighting The total number of terms after pruning and stopword removal was 4111. For the tag-document tags separated by"and", "l or" &"were treated as separate tags. otherwise, the words were Fig. 1. The graphical model for the tag-topic model using plate notation. Ihttp://www.icwsm.org/e data. html
We have extended the AT model for analysis of blog tags. For the tag-topic (TT) model, each tag is represented by a probability distribution over topics, and each topic represented by a probability distribution over terms for that topic. Fig. 1 shows the generative model of the TT model using plate notation. For the TT model, the probability of generating a blog is given by: YNb i¼1 1 Tb X l XK t¼1 /withtl ð2Þ where blog b has Tb tags. The probability is then integrated over / and h and their Dirichlet distributions and sampled using the Gibbs sampling Monte Carlo technique. The similarity matrices for tags and content can then be calculated using the symmetrized Kullback Leibler (KL) distance between topic distributions, which is able to measure the difference between two probability distributions. The similarity matrices can be visualized using the Isomap dimensionality technique described in the following section. 3.3. Isometric feature mapping (Isomap) Isomap (Tenenbaum et al., 2000) is a nonlinear dimensionality reduction technique that uses multidimensional scaling (MDS) (Davison, 2000) techniques with geodesic interpoint distances instead of Euclidean distances. Geodesic distances represent the shortest paths along the curved surface of the manifold. Unlike the linear techniques, Isomap can discover the nonlinear degrees of freedom that underlie complex natural observations (Tenenbaum et al., 2000). Isomap deals with finite data sets of points in Rn which are assumed to lie on a smooth submanifold Md of low dimension d < n. The algorithm attempts to recover M given only the data points. Isomap estimates the unknown geodesic distance in M between data points in terms of the graph distance with respect to some graph G constructed on the data points. Isomap algorithm consists of three basic steps: (1) Find the nearest neighbors on the manifold M, based on the distances between pairs of points in the input space. (2) Approximate the geodesic distances between all pairs of points on the manifold M by computing their shortest path distances in the graph G. (3) Apply MDS to matrix of graph distances, constructing an embedding of the data in a d-dimensional Euclidean space Y that best preserves the manifold’s estimated intrinsic geometry (Tenenbaum et al., 2000). If two points appear on a nonlinear manifold, their Euclidean distance in the high-dimensional input space may not accurately reflect their intrinsic similarity. The geodesic distance along the low-dimensional manifold is thus a better representation for these points. The neighborhood graph G constructed in the first step of allows an estimation of the true geodesic path to be computed effi- ciently in step two, as the shortest path in G. The two-dimensional embedding recovered by Isomap in step three, which best preserves the shortest path distances in the neighborhood graph. The embedding now represents simpler and cleaner approximations to the true geodesic paths than do the corresponding graph paths (Tenenbaum et al., 2000). Isomap is a very useful noniterative, polynomial-time algorithm for nonlinear dimensionality reduction. Isomap is able to compute a globally optimal solution, and for a certain class of data manifolds (Swiss roll), is guaranteed to converge asymptotically to the true structure (Tenenbaum et al., 2000). However, Isomap may not easily handle more complex domains such as non-trivial curvature or topology. Because a previous study showed that Isomap was generally able to perform well on visualization of synthetic as well as real-world data (Tsai & Chan, 2007b), we have applied Isomap for visualizing blog content and tags. 4. Experiments and results We used the tag-topic model for blog data mining on our collection of real-world blog data. Dimensionality reduction was performed with Isomap to show the similarity plot of blog content and tags. Experiments show that the tag-topic model can reveal interesting patterns in the underlying tags and topics for our dataset of security-related blogs. 4.1. Data corpus For our experiments, we extracted a subset of the Nielson BuzzMetrics blog data corpus1 that focuses on blogs related to security threats and incidents related to cyber crime and computer viruses. The original dataset consists of 14 million blog posts collected by Nielsen BuzzMetrics for May 2006. Although the blog entries span only a short period of time, they are indicative of the amount and variety of blog posts that exists in different languages throughout the world. Blog entries related to security threats such as malware, cyber crime, computer virus, encryption, and information security were extracted by keyword search and stored for use in our analysis. There were a total of 3096 entries in our dataset; however, as most of the blog posts do not have tags associated with them, we eliminated those documents with null or blank tags, as well as those with tags labeled as ‘‘uncategorized’’. Each of the remaining 948 blog entries was saved as a text file for further text preprocessing. For the preprocessing of the blog content, HTML tags were removed, lexical analysis was performed by removing stopwords, stemming, and pruning by the Text to Matrix Generator (TMG) (Zeimpekis & Gallopoulos, 2006) prior to generating the term-document matrix using term frequency (TF) local term weighting. The total number of terms after pruning and stopword removal was 4111. For the tag-document matrix, tags separated by ‘‘and’’, ‘‘/’’, or ‘‘& ’’ were treated as separate tags. Otherwise, the words were Fig. 1. The graphical model for the tag-topic model using plate notation. 1 http://www.icwsm.org/data.html. 5332 F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335
F.S. Tsai/ Expert Systems with Applications 38(2011)5330-5335 Since tags are user generated, there is often a problem of mislabel- 1: malware g, or using long phrases instead of one or two words to tag a blog. Term loggers also have a tendency to use the same tag for many or all of their posts, no matter what the subject. 003283 4. 2. Blog content visualization For visualizing the document similarities, the symmetrize 002355 Kullback Leibler distance between topic distributions was calcu lated for each document pair. Fig. 2 shows the 2D plot of the doc malwar 00170 ument similarities based on the document-topic distributions. A Probability random sample of 100 titles were taken in and shown in the plot. 005079 For visualizing the tag similarities, the symmetrized Kullback Leibler distance between topic distributions was calculated for each tag pair. Fig. 3 shows the 2d plot of the tag similarities based dist gs In the plc combined to form one tag. The tag-document matrix was gener ated with binary local term weighting, resulting in a total of 552 nique tags. The term-document matrix and tag-document matrix were used to compute the tag-topic model aware 0021 In this model, each tag is represented by a probability distribu 001868 01800 tion over topics, and each topic is represented as a probability dis tribution over terms for that topic(Steyvers et al., 2004). The topic erm and tag-topic distributions were then learned from the blog system 001320 data in an unsupervised manner. The parameters used in our 001234 experiments were the number of topics(t= 50)and number of iter- person ations(N=2000). We used symmetric Dirichlet priors in the Tt estimation with a= 50/t and B=0.01, which are common settings awareness in the literature and corresponding tags from each topic of thankyouforsmoking 003719 the blog entry collection are listed in Tables 3-6. atchingupwithtowanda 0.01765 From the results, we observe that some of the blog tags I 01529 be very descriptive of the topic. For example, for the te nanism the tags"quizzes", thankyouforsmoking", aquifer", and"catchi- gupwithtowanda"do not seem especially relevant to the topic. 000756 Table 4 Table 6 Topic 22: Windows security. Topic 48: Identity theft. Probability Term 001476 001503 0.03511 04562 nea 001637 mobilesociety 001127
combined to form one tag. The tag-document matrix was generated with binary local term weighting, resulting in a total of 552 unique tags. The term-document matrix and tag-document matrix were used to compute the tag-topic model. In this model, each tag is represented by a probability distribution over topics, and each topic is represented as a probability distribution over terms for that topic (Steyvers et al., 2004). The topicterm and tag-topic distributions were then learned from the blog data in an unsupervised manner. The parameters used in our experiments were the number of topics (t = 50) and number of iterations (N = 2000). We used symmetric Dirichlet priors in the TT estimation with a = 50/t and b = 0.01, which are common settings in the literature. The most likely terms and corresponding tags from each topic of the blog entry collection are listed in Tables 3–6. From the results, we observe that some of the blog tags may not be very descriptive of the topic. For example, for the topic Spyware, the tags ‘‘quizzes’’, ‘‘thankyouforsmoking’’, ‘‘aquifer’’, and ‘‘catchingupwithtowanda’’ do not seem especially relevant to the topic. Since tags are user generated, there is often a problem of mislabeling, or using long phrases instead of one or two words to tag a blog. Bloggers also have a tendency to use the same tag for many or all of their posts, no matter what the subject. 4.2. Blog content visualization For visualizing the document similarities, the symmetrized Kullback Leibler distance between topic distributions was calculated for each document pair. Fig. 2 shows the 2D plot of the document similarities based on the document-topic distributions. A random sample of 100 titles were taken in and shown in the plot. 4.3. Blog tag visualization For visualizing the tag similarities, the symmetrized Kullback Leibler distance between topic distributions was calculated for each tag pair. Fig. 3 shows the 2D plot of the tag similarities based on the tag-topic distributions of the most popular tags. In the plot, Table 3 Topic 11: malware. Term Probability browser 0.07184 worm 0.04667 yahoo 0.03283 user 0.03121 safeti 0.02768 instal 0.02488 facetim 0.02355 hijack 0.02002 malwar 0.01870 site 0.01708 Tag Probability world 0.13636 web 0.09365 videogames 0.07790 links 0.05805 www 0.05079 news 0.05011 opinion 0.03409 internet 0.03245 windows 0.02834 economy 0.02369 Table 4 Topic 22: Windows security. Term Probability threat 0.02759 secure 0.02566 custom 0.02227 window 0.02203 antivirus 0.02178 beta 0.01985 protect 0.01960 response 0.01839 vista 0.01839 offer 0.01476 Tag Probability diggnews 0.47986 miscellanea 0.03511 gallery 0.02606 world 0.02111 musique 0.01960 spywarenews 0.01637 blogging 0.01271 warroom 0.01228 photos 0.00862 mobilesociety 0.00797 Table 5 Topic 26: Spyware. Term Probability spyware 0.10403 comput 0.02331 software 0.02177 anti 0.01868 yahoo 0.01800 web 0.01594 user 0.01525 system 0.01320 new 0.01234 person 0.01183 Tag Probability spywarenews 0.52080 quizzes 0.04806 thankyouforsmoking 0.04412 aquifer 0.03719 catchingupwithtowanda 0.01765 writing 0.01623 spywarebooks 0.01529 secularhumanism 0.00961 sport 0.00804 warroom 0.00756 Table 6 Topic 48: Identity theft. Term Probability secure 0.04668 card 0.02941 theft 0.02462 access 0.02334 credit 0.02302 compani 0.01982 ident 0.01695 execute 0.01567 laptop 0.01567 employe 0.01503 Tag Probability photos 0.31245 security 0.04562 religion 0.03325 miscellanea 0.03243 vehicles 0.02556 review 0.01539 veggingout 0.01182 wespen 0.01154 intellisense 0.01127 writing 0.01127 F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335 5333
FS. Tsai/ Expert Systems with Applications 38(2011)5330-5335 buying diebold hackers exploiting umpatched chitectural zfone encrypts waip calls e對a@kh linked by shan( cybe¢的t spyware advice py. kids just say no! cyber black Fig. 2. Results on visualization of blog content using Isomap(k= 100). ad plot the results of the blog document simi s, based on the same techniques. the tags are use generated there may be some inherent noise in the tags. Dimensionality reduction can help remove the gauidepywarenews the tags, and may prove useful for future studie on tag mining and visualization. The tag-topic model can be ex- tended in the future for larger datasets as well as other types of so news cial media with semantic annotations References Fig 3. Results on visualization of blog tags using Isomap(k= 20) Blei, D M, Ng, A Y Jordan, M. L(2003) Latent Dirichlet allocation. J. Mach. Learn Res.3.993-1022 each tag was scaled according to the number of blogs posted using Chen, Y, Tsai, F S,& Chan, K L(2007). Blog search and mining in the business the tags are proportional to the similarity between tags on the topic distributions of the Chen, Y Tsai, E S& Chan, K L(2008) Machine learning techniques for business logs that were posted from the graphs, the majority of blogs in our dataset were tagged with either " spywarenews"or Deerswoes ter. 2. umais. s. T. Furnas, C w. dauer:. T. e Harshman R(1990) latent semantic analysis. Journal of the American Society for be solved when a larger set of blogs are taken. In addition, some Kolari. Finin,t nce. 41(6)391-407 may arise due to nonstandardized tag labels. This problem may Inf of the tags overlap because they are tagged to the same or similar and splog detection In AAAl spring symposium on computational approaches to topics. This may be due to the specialized nature of our dataset, Liang H. Tsai, E S, Kwee, &A T (2009). Detecting novel business blo which focused on security blogs. If a larger set of blogs are taken, there may not be as many overlapping tag Lin, Y.-R, Sundaram, Y. Tatemura, J,& Tseng, B L(200 using self-similarity analysis on blog temporal dynamic etrieval on the web(pp. 1-8). New York, NY, USA: ACM. 5. Conclusion and future work Macdonald, C, Ounis, L, Soboroff, L (2007). Overview of the TREC-2007 blog track. Ounis, I, de rijke, M, Macdonald, C, Mishne, G A, Soboroff, L(2006). Overview of aper, we proposed a tag-topic model fo the TREC-2006 Blog track. In TREC 2006 working notes(pp. 15-27). ased on the Author-Topic model. In this model, each tag is repre Ounis, I, Macdonald, C, Soboroff, L (2008). Overview of the TREC-2008 Blog track. represented as a probability distribution over terms for that topic. Rosen-Zvi, M, Griffiths, T, Steyvers, ML, Smyth, P (2004). The author-topic mode This can solve the problem of finding the most likely tags and uncertainty in artificial intelligence(pp. 487-494) Arlingte terms for a given topic. Steyvers, M, Smyth, P, Ro vi. M. Griffiths. T. We have successfully implemented and evaluated the tag-topic discovery In KDD 04: model on real-world security blogs. Using the output of the tag-te SIGKDD international conference on knowledge discovery and data mining pic model, we present results in visualizing which tags are similar Tenenbaum. ]. de Silva, V& Langford, ).(2000) A global geometric framework for to each other with the Isomap dimensionality reduction technique nonlinear dimensionality reduction. Scie
each tag was scaled according to the number of blogs posted using that tag. The distances between the tags are proportional to the similarity between tags, based on the topic distributions of the blogs that were posted. As seen from the graphs, the majority of blogs in our dataset were tagged with either ‘‘spywarenews’’ or ‘‘news’’. Because of the free-form nature of the tags, problems may arise due to nonstandardized tag labels. This problem may be solved when a larger set of blogs are taken. In addition, some of the tags overlap because they are tagged to the same or similar topics. This may be due to the specialized nature of our dataset, which focused on security blogs. If a larger set of blogs are taken, there may not be as many overlapping tags. 5. Conclusion and future work In this paper, we proposed a tag-topic model for blog mining based on the Author-Topic model. In this model, each tag is represented by a probability distribution over topics, and each topic is represented as a probability distribution over terms for that topic. This can solve the problem of finding the most likely tags and terms for a given topic. We have successfully implemented and evaluated the tag-topic model on real-world security blogs. Using the output of the tag-topic model, we present results in visualizing which tags are similar to each other with the Isomap dimensionality reduction technique. In addition, we also plot the results of the blog document similarities, based on the same techniques. Since the tags are user generated, there may be some inherent noise in the tags. Dimensionality reduction can help remove the noise in the tags, and may prove useful for future studies focusing on tag mining and visualization. The tag-topic model can be extended in the future for larger datasets as well as other types of social media with semantic annotations. References Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res., 3, 993–1022. Chen, Y., Tsai, F. S., & Chan, K. L. (2007). Blog search and mining in the business domain. In DDDM ’07: Proceedings of the 2007 international workshop on domain driven data mining (pp. 55–60). New York, NY, USA: ACM. Chen, Y., Tsai, F. S., & Chan, K. L. (2008). Machine learning techniques for business blog search and mining. Expert Systems and Applications, 35(3), 581–590. Davison, M. (2000). Multidimensional scaling. Florida: Krieger Publishing Company. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. Kolari, P., Finin, T., & Joshi, A. (2006). SVMs for the blogosphere: Blog identification and splog detection. In AAAI spring symposium on computational approaches to analysing Weblogs. Liang, H., Tsai, F. S., Kwee, & A. T. (2009). Detecting novel business blogs. In ICICS 2009–Conference Proceedings of the 7th international conference on information, communications and signal processing (ICICS). Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J., & Tseng, B. L. (2007). Splog detection using self-similarity analysis on blog temporal dynamics. In AIRWeb ’07: Proceedings of the third international workshop on Adversarial information retrieval on the web (pp. 1–8). New York, NY, USA: ACM. Macdonald, C., Ounis, I., & Soboroff, I. (2007). Overview of the TREC-2007 blog track. In The sixteenth text REtrieval conference (TREC 2007) proceedings. Ounis, I., de Rijke, M., Macdonald, C., Mishne, G.A., & Soboroff, I. (2006). Overview of the TREC-2006 Blog track. In TREC 2006 working notes. (pp. 15–27). Ounis, I., Macdonald, C., & Soboroff, I. (2008). Overview of the TREC-2008 Blog track. In TREC 2008 working notes. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In AUAI ’04: Proceedings of the 20th conference on uncertainty in artificial intelligence (pp. 487–494). Arlington, Virginia, United States: AUAI Press. Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic authortopic models for information discovery. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 306–315). New York, NY, USA: ACM. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. tes ict blog no. about trojans viewpoint media player eat the dog food, drink the kool aid... cheating adsense yhoo32.explr malware threat related to yahoo! messenger racerx e keeping the software free frightening world out here! about trojans trojan out of nowhere tech radio about adware the awesome five how to fix the va information theft problem ca to offer free etrust ez antivirus to microsoft windows vista beta users htmltron dammit... psp sunday, may macs may no longer be immune to profiling the hacker viruses the things i do for my friends another trip! malware is getting smarter, each day it puzzles us! thirsty for qoolaid free adware internet disclaimer news new trojan horse threatens to delete files unless you pay up useful firefox extensions about trojans the guys (ed skoudis, tom liston and mike poor) at agnitum outpost firewall pro . (build ) life as it goes may , cyber blackmail increasing best boat loans virtual task force nets cyber criminals torrent infectado stop pima county from buying diebold voting machines hackers straks ook in de cola top three computer protection priorities april malware review yonkers spybot definition file update . . attention please new safe browser now available yike shameless self global virus, spam and phishing trends story time! helping law enforcement fight cyber a class all about spyware new trojan targets word random stuff, dissecting leftism optical scan machines fail in michigan , officials .... apple airs new mac commercial diebold voting systems critically flawed altiris svs un broadcasting treaty restricts free speech attention virus about keyloggers welcome a newcomer in our spyware and adware collection. consigned to the waste basket stupid people first antivirus for s60 3rd edition microsoft hackers exploiting unpatched five architectural flaws in windows flaw in ms .... windows live safety center may not remove some malware first antivirus for s60 3rd edition security spy data furor woo sea angel apple sans viruses and malware the exile files thebroken check it out! zfone encrypts voip calls linked by shanmuga customers who bought sony cds with xcp copy control .... kids just say no! xoftspy , , .... cyber criminals targeting gamers nerdy news in april spyware advice your fortune calls for efficacious blocker for wood flooring low cost installed spam and malicious software! northwest mortgage new e yahoo! im worm Fig. 2. Results on visualization of blog content using Isomap (k = 100). artculos blog blogging chrysler diggnews emolenindianpolis gadgetnews general hair internet links movies news recipes weasley security spywarenews miscellanea tech politics Fig. 3. Results on visualization of blog tags using Isomap (k = 20). 5334 F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335
FS. Tsai/ Expert Systems with Applications 38(2011)5330-5335 Tsai, F S,& Chan, K L(200 g cyber security threats in weblogs using Tsai, F S, Han, w, Xu, J ,& chua, H C(2009). Design an probabilistic models. In r-to-peer social networking application. Expert ation. In 2007 6th intemational conference on information, communications ei, D B& Gallopoulos, E(2006) TMG: A MATLAB Toolbox for generating data(pp. 187-210). Cambridge, MA: MIT Press
Tsai, F. S., & Chan, K. L. (2007a). Detecting cyber security threats in weblogs using probabilistic models. In Lecture notes in computer science LNCS. (Vol. 4430, pp. 46–57). Tsai, F. S., & Chan, K. L. (2007b). Dimensionality reduction techniques for data exploration. In 2007 6th international conference on information, communications and signal processing, ICICS. Tsai, F. S., Han, W., Xu, J., & Chua, H. C. (2009). Design and development of a mobile peer-to-peer social networking application. Expert Systems and Applications, 36(8), 11077–11087. Zeimpekis, D., & Gallopoulos, E. (2006). TMG: A MATLAB Toolbox for generating term-document matrices from text collections. In Grouping multidimensional data (pp. 187–210). Cambridge, MA: MIT Press. F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335 5335