ncorporating structured World Knowledge into unstructured documents via Heterogeneous Information Networks Yangqiu song 香港科技大學 THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY
Incorporating Structured World Knowledge into Unstructured Documents via Heterogeneous Information Networks Yangqiu Song 1
Collaborators Chenguang Wang Ming Zhang Yizhou sun 闭公 Jiawei han Dan roth Slides Credit: Chenguang Wang
Collaborators Chenguang Wang Ming Zhang Yizhou Sun Jiawei Han Dan Roth Slides Credit: Chenguang Wang 2
Outline Text Analytics: Motivation Two Challenges Representation ·Labe|s Text Categorization via hin HIN COnstruction from texts From hin similarity to clustering and classification World knowledge indirect supervision Conclusions and future work
Outline • Text Analytics: Motivation – Two Challenges • Representation • Labels • Text Categorization via HIN – HIN construction from texts – From HIN similarity to clustering and classification – World knowledge indirect supervision • Conclusions and future work 3
Text Categorization Two Challenges Impacts many applications Social network analysis health care, machine reading Traditional approach Label Train a Mak data classifier prediction Two challenges √ Representation Labels
Text Categorization: Two Challenges • Impacts many applications! ✓ Social network analysis, health care, machine reading … • Traditional approach: • Two challenges: ✓ Representation ✓ Labels 4 Label data Train a classifier Make prediction
Representation Bag-of-words On feb 8 don d that he 7 Februarv 20 reat day in Mobile games Sports Flappy bird Russia lOS Olympics inter Android apps champions Sochi stores game mountains beaches Usiclans sports Internet tro‖l." Trom /to 23 February 2014
Representation: Bag-of-words 5 On Feb. 8, Dong Nguyen announced that he would be removing his hit game Flappy Bird from both the iOS and Android app stores, saying that the success of the game is something he never wanted. Some fans of the game took it personally, replying that they would either kill Nguyen or kill themselves if he followed through with his decision. Frank Lantz, the director of the New York University Game Center, said that Nguyen's meltdown resembles how some actors or musicians behave. "People like that can go a little bonkers after being exposed to this kind of interest and attention," he told ABC News. "Especially when there's a healthy dose of Internet trolls." 7 February 2014 is going to be a great day in the history of Russia with the upcoming XXII Winter Olympics 2014 in Sochi. As the climate in Russia is subtropical, hence you would love to watch ice capped mountains from the beautiful beaches of Sochi. 2014 Winter Olympics would be an ultimate event for you to share your joys, emotions and the winning moments of your favourite sports champions. If you are really an obsessive fan of Winter Olympics games then you should definitely book your ticket to confirm your presence in winter Olympics 2014 which are going to be held in the provincial town, Sochi. Sochi Organizing committee (SOOC) would be responsible for the organization of this great international multi sport event from 7 to 23 February 2014. Flappy Bird iOS Android apps stores game musicians Russia Winter Olympics Sochi mountains beaches sports champions Mobile Games Sports
Context: Topic Models and Word Embeddings Topic Modeling(blei et al. 2003 Topics Documents Topic proportions and assignments tenetic0.自1 Seeking Life's Bare(Genetic) Necessities COLD NIN HARn. NEW YOur=“出m面由f甲 e La wel a the re BoDw中 w tue l I life evolv : oran 器 出“时 n Ilsla Molel Gaone brain h时 neuro w IwnAh't l ewh nerve Ahlesu h tl Hamden Lm LE L dat a nunter sIN1.V 14.24 MAY IN computer Figure source: Blei, D M.(2012). Probabilistic topic models. Communications of the ACM, 55(4),77-84
Context: Topic Models and Word Embeddings • Topic Modeling (Blei et al., 2003) 6
Context: Topic Models and Word Embeddings · Word embedding Softmax classifier Word2vec(Nikolov et al. 13 Glove(Pennington et al. 14 Matrix factorization ∑ embedding (Deerwester 90; Levy et al 15 Projection layer the cat sits on themat Italy Mad Germany walked Berlin swam Russ⊥ walki Canada v⊥ etna Hanoi Male-Female Verb tense Country-Capital https://www.tensorflow.org/versions/ro.7/tutorials/word2vec/index.html
Context: Topic Models and Word Embeddings • Word embedding – Word2vec (Mikolov et al., 13) – Glove (Pennington et al., 14) – Matrix factorization (Deerwester’90;Levy et al., 15) – … https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html 7
What's Missing · The semantics of entities and their relatⅰons Ohama On Feb 10, 2007 Obama announced his candidacy for President of the United St old State front of the Old State Capitol located in portrayed passionate Bush portrayed himself as a compassionate conservative, implying he was more suitable Republicans than other Republicans to go to lead the United States Bush What can context cover New york ys. New york times What cannot? George Washington "VS. Washington Higher order relations Affiliation In Affiliation In Contains Contains Document- Basketbal‖l NBA Basketball -Document Documentsontains Conte Basketball Olympics Basketball Document
What’s Missing? 8 • The semantics of entities and their relations • What can context cover? • What cannot? – Higher order relations ``New York'' vs. ``New York Times'' ``George Washington'' vs. ``Washington'' Document Basketball NBA Basketball Document Contains Contains Affiliation In Affiliation In Document Basketball Olympics Basketball Document Contains Contains
Outline Text Analytics: Motivation Two Challenges Representation Labels Text Categorization via hin HIN cOnstruction from texts From hin similarity to clustering and classification World knowledge indirect supervision Conclusions and future work
Outline • Text Analytics: Motivation – Two Challenges • Representation • Labels • Text Categorization via HIN – HIN construction from texts – From HIN similarity to clustering and classification – World knowledge indirect supervision • Conclusions and future work 9
Acquire Labeled data Expert Semi-supervised Annotation Crowdsourcing /transfer learning f Fast changing domains so amazon mechanical turk Baic百度 HERE smartart t/cheek ToCheek freelancer amazon YAH Simple tasks Many diverse domains Only big companies can Media Aceris hire a lot of experts Low quality Costl Still costly Domain dependent 10
Acquire Labeled Data Expert Annotation Costly Crowdsourcing Simple tasks Low quality Still costly Semi-supervised /transfer learning Domain dependent Many diverse domains Fast changing domains Only big companies can hire a lot of experts 10