Text Mining NLP ML (Hierarchical)Topic Modeling Yueshen Xu (lecturer) ysxu@xidian.edu.cn xuyueshen@163.com Data and Knowledge Engineering Research Center Xidian University
(Hierarchical) Topic Modeling Yueshen Xu (lecturer) ysxu@xidian.edu.cn / xuyueshen@163.com Data and Knowledge Engineering Research Center Xidian University Text Mining & NLP & ML
Outline 历些毛子代拔大》 XIDIAN UNIVERSITY ▣Background ▣Some Concepts ▣Topic Modeling Probabilistic Latent Semantic Indexing (PLSI) Latent Dirichlet Allocation(LDA) Hierarchical Topic Modeling Basics,not Chinese Restaurant Process(CRP) state-of-the-art ▣Vhat I do Supplement Reference Keywords:topic modeling,hierarchical topic modeling,probabilistic graphical model,Bayesian model 2016/12/29 Software Engineering
2016/12/29 Software Engineering Outline Background Some Concepts Topic Modeling Probabilistic Latent Semantic Indexing (PLSI) Latent Dirichlet Allocation (LDA) Hierarchical Topic Modeling Chinese Restaurant Process (CRP) What I do Supplement & Reference 2 Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical model, Bayesian model Basics, not state-of-the-art
Background 历忠毛子代枚大兽 XIDIAN UNIVERSITY Information Overloading Big Data Cloud Com uting Chinese International Travel Monitor 2015 at a giance Hoteis.com Artificiatelligence Deep Kearnig 旦》只器 号2 we need 0> summarization 目强>帼器 cn Visualization 是%>2 6 ★意三 Dimensional Reduction 2016/12/29 Software Engineering
2016/12/29 Software Engineering Background Information Overloading 3 we need summarization Visualization Dimensional Reduction Big Data Cloud Computing Artificial Intelligence Deep Learning ,…, etc
Background 历忠荒子代枝大学 XIDIAN UNIVERSITY ▣Text Summarization Document Summarization What do these docs (or this doc)talk about? Laptop Reviews Review Summarization What do these consumers care about or complain about? Short Text/Tweets Summarization narents didn't come to America all What are people discussing about? .ring nod-comp ▣Basic Requirement 动时A亲贵去门自 Automatic Applicable Explainable Topic Modeling 2016/12/29 Software Engineering
2016/12/29 Software Engineering Background Text Summarization Document Summarization What do these docs (or this doc) talk about? Review Summarization What do these consumers care about or complain about? Short Text/Tweets Summarization What are people discussing about? 4 Automatic Applicable Explainable Basic Requirement Topic Modeling
Some Concepts 历些毛子种枝大” XIDIAN UNIVERSITY Information Retrieva ▣General Concepts Dimension Latent Semantic Analysis Machine ■Text Mining Learning Topic Reduction Modeling Natural Language Processing LSA Computational Linguistics Text Mining Information Retrieval Machine Natual Lauguage Processing ■Dimension Reduction Translation Computational Linguistics ■Topic Modeling Data Mining to learn the latent topics from a corpus/document 2016/1229 Software Engineering
2016/12/29 Software Engineering General Concepts Latent Semantic Analysis Text Mining Natural Language Processing Computational Linguistics Information Retrieval Dimension Reduction Topic Modeling Some Concepts 5 Information Retrieval Computational Linguistics Natural Language Processing LSA/Topic Model Text Mining LSA Data Mining Reduction Dimension Machine Learning Machine Translation Topic Modeling to learn the latent topics from a corpus/document
Topic Modeling 历些毛子代枝大等 XIDIAN UNIVERSITY ▣Topic modeling an example in Chinese(from my doctorate thesis) Corpus 继续实施稳健的货币政策,保 从员额上来看,这次改革远远超 c持松紧适度适时预调微调,做 过了裁军的数量,它是一种结构 6 好与供给侧结构,并综合运用 Poc在的改革,是军队组织结构现代 数量、价格等多种货币政策 化的一个关键步骤 美元作为主要国际货币的地位在 独立学院从母体高校“断奶”后 可预见的将来仍无可取代,唯 可能会面临品牌、招生等方面阵 的出路是推动全球治理向更均衡 痛,但是在国家和省市鼓励民间 Dqc的方向发展。国际货币基金组织 资本进入教育领域的实施意见发 总裁拉加德日前在美国马里兰大 Doc布后,一些独立学院果断切割连 学演讲时就呼吁,国际治理改革 接母体大学的“脐带”,自立门 应认清新兴经济体越来越重要这 户发展。 现实。 2016/12/29 Software Engineering
2016/12/29 Software Engineering Topic Modeling Topic modeling an example in Chinese (from my doctorate thesis) 6 继续实施稳健的货币政策,保 持松紧适度适时预调微调,做 好与供给侧结构,并综合运用 数量、价格等多种货币政策 从员额上来看,这次改革远远超 过了裁军的数量,它是一种结构 性的改革,是军队组织结构现代 化的一个关键步骤 美元作为主要国际货币的地位在 可预见的将来仍无可取代,唯一 的出路是推动全球治理向更均衡 的方向发展。国际货币基金组织 总裁拉加德日前在美国马里兰大 学演讲时就呼吁,国际治理改革 应认清新兴经济体越来越重要这 一现实。 独立学院从母体高校“断奶”后, 可能会面临品牌、招生等方面阵 痛,但是在国家和省市鼓励民间 资本进入教育领域的实施意见发 布后,一些独立学院果断切割连 接母体大学的“脐带”,自立门 户发展。 Corpus Doc 1 Doc2 Doc 3 Doc4
Topic Modeling 历忠子代枚大号 XIDIAN UNIVERSITY ▣After topic modeling Corpus Topic 金融 0.074 1 货币 0.051 继续实施稳健的货政策,保 从员额上来看,这次改革远远 持松紧适度适时预调微调,做 超过了裁军的数量,它是一种 Topic 政策 0.082 好与供给侧结构,并综合运用 结构性的改革,是军队组织结 改革 0.063 数量、价格等多种货政策 构现代化的一个关键步骤 Topic 学院 0.077 美元作为主要国际货市的地位 独立学院从母体高校“断奶” 3 教商 0.071 在可预见的将来仍无可取代, 后,可能会面临品隙招生等 唯一的出路是准全球治理向 更购的方向发展。国际货而 Doc 方面阵痛,但是在国家和省市 Topic 军队0.083 1 基金组织总裁拉加德日前在美 鼓励民间资本进入教育领域的 组织0.079 国马里兰大演讲时就呼吁, 实施意见发布后,一些独立学 国际治理改革应认清新兴经济 院果断切割连接母体大学的 体越来越重要这一现实。 “脐带”,自立门户发展。 金融 政策 学院 军队 货币 改革 改革 结构 改革 组织 教育 组织 结构 国际 高 超过 。 大学 国家 招生 裁军 2016/12/29 > Software Engineering
2016/12/29 Software Engineering Topic Modeling After topic modeling 7 继续实施稳健的货币政策,保 持松紧适度适时预调微调,做 好与供给侧结构,并综合运用 数量、价格等多种货币政策 政策 0.082 改革 0.063 … 金融 0.074 货币 0.051 … 学院 0.077 教育 0.071 … 军队 0.083 组织 0.079 … 从员额上来看,这次改革远远 超过了裁军的数量,它是一种 结构性的改革,是军队组织结 构现代化的一个关键步骤 美元作为主要国际货币的地位 在可预见的将来仍无可取代, 唯一的出路是推动全球治理向 更均衡的方向发展。国际货币 基金组织总裁拉加德日前在美 国马里兰大学演讲时就呼吁, 国际治理改革应认清新兴经济 体越来越重要这一现实。 独立学院从母体高校“断奶” 后,可能会面临品牌、招生等 方面阵痛,但是在国家和省市 鼓励民间资本进入教育领域的 实施意见发布后,一些独立学 院果断切割连接母体大学的 “脐带”,自立门户发展。 … … … … Corpus Doc 1 Doc 2 Doc3 Doc 4 Topic 2 Topic 3 Topic 4 Topic 1
Topic Modeling 历些毛子代枚大学 XIDIAN UNIVERSITY ▣A topic ■A word cluster→a group of words Not clustered randomly,but meaningfully(not semantically) ▣Models auto car make engine emissions hidden ■Parametric models bonnet hood Markov tyres make model Latent Semantic Indexing(LSI) lorry model emissions boot trunk normalize >PLSI;Latent Dirichlet Allocation(LDA) Non-parametric models(Dirichlet Process) >(Nested)Chinese Restaurant Process Indian Buffet Process Pitman-Yor Process 2016/12/29 Software Engineering
2016/12/29 Software Engineering Topic Modeling A topic A word cluster a group of words Not clustered randomly, but meaningfully (not semantically) 8 Models Parametric models Latent Semantic Indexing (LSI) PLSI; Latent Dirichlet Allocation (LDA) Non-parametric models (Dirichlet Process) (Nested) Chinese Restaurant Process Indian Buffet Process Pitman-Yor Process
Topic Modeling 历些毛子种枝大学 XIDIAN UNIVERSITY ▣pLSI Model One layer of 'Deep Neutral Network' ▣Assumption p(w[=) p(z d) p(d) Pairs(d,w)are assumed to be generated independently W Conditioned on z,w is generated independently of d W2 Words in a document are exchangeable Documents are exchangeable WN Latent topics z are independent The generative process Multinomial Distribution p(d,w)=p(wld)p(d)=p(d)(w,=d)=p(d)(p(=d) Multinomial Distribution 2016/1229 9 Software Engineering
2016/12/29 Software Engineering Topic Modeling 9 pLSI Model w1 w2 wN z1 zK z2 d1 d2 dM ….. ….. ….. p(w | z) p(z | d) p(d) Assumption Pairs(d,w) are assumed to be generated independently Conditioned on z, w is generated independently of d Words in a document are exchangeable Documents are exchangeable Latent topics z are independent The generative process ∑ z∈Z ∑ z∈Z p(d,w) = p(w| d) p(d) = p(d) p(w,z | d) = p(d) p(w| z) p(z | d) Multinomial Distribution Multinomial Distribution One layer of ‘Deep Neutral Network’
Topic Modeling 历些毛子代枚大学 XIDIAN UNIVERSITY Latent Dirichlet Allocation (LDA) David M.Blei,Andrew Y.Ng,Michael I.Jordan Hierarchical Bayesian model;Bayesian pLSI Generative process of LDA > Choose N~Poisson(); For each document d={w1,w2 ..wn} Choose 0~Dir(a);For each of the N words wn in d: a)Choose a topic zn~Multinominal(0) iterative times b)Choose a word wn from p(wnlzn,B), a multinomial distribution conditioned on zm 2016/12/29 Software Engineering
2016/12/29 Software Engineering Topic Modeling 10 Latent Dirichlet Allocation (LDA) David M. Blei, Andrew Y. Ng, Michael I. Jordan Hierarchical Bayesian model; Bayesian pLSI θ z w N M α β iterative times Generative process of LDA Choose N ~ Poisson(𝜉); For each document d={𝑤1, 𝑤2 … 𝑤𝑛} Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N words 𝑤𝑛 in d: a) Choose a topic 𝑧𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃 b) Choose a word 𝑤𝑛 from 𝑝 𝑤𝑛 𝑧𝑛, 𝛽 , a multinomial distribution conditioned on 𝑧𝑛