第 10 期武森等: 基于 MapＲeduce 的大规模文本聚类并行化

正在加载图片...

第10期武森等：基于MapReduce的大规模文本聚类并行化 ·1419· large clusters /Proceedings of the 6th Symposium on Operating [15]Zheng W,Ji D,Cai D F,et al.An approach to center selection Systems Design.San Francisco,2004:137 based on minimal similarity among texts.J Guangxi Norm Univ B]Yao Q Y,Liu G S,L X.VSM-based text clustering algorithm. Nat Sci Ed,.2008,26(3):198 Comput Eng,2008,34(18):39 (郑伟，季铎，蔡东风，等.基于文本互为最小相似度的中心 (姚清耘，刘功申，李翔.基于向量空间模型的文本聚类算选取方法.广西师范大学学报：自然科学版，2008,26(3)：法.计算机工程，2008,34(18)：39) 198) 4]Zhang X D,Zhou X H,Hu X H.Semantic smoothing for model- [16]Zhong S.Efficient online spherical K-means clustering /Pro- based document clustering /Proceedings of the Sixth International ceedings of 2005 IEEE International Joint Conference on Neural Conference on Data Mining.Washington:IEEE Computer Society, Neticorks.IEEE,2005:3180 2006:1193 [17]Scholkopf B,Weston J,Eskin E,et al.A kernel approach for Bharathi C.Venkatesan D.Study of ontology or thesaurus based leaming from almost orthogonal pattems.Lect Notes Comput Sci, document clustering and information retrieval.I Theor Appl Inf 2002,2431:494 Technol,.2012,40(1):55 [18]Ding Y,Fu X.The research of text mining based on selforgani- [6]Ma J,Xu W,Sun Y,et al.An ontology-based text-mining method zing maps.Procedia Eng,2012,29:537 to cluster proposals for research project selection.IEEE Trans Syst 9]Ceema I,Kavitha M.RenukadeviG,et al.Clustering web docu- Man Cybern Part A,2012,42(3):784 ments using hierarchical method for efficient cluster formation. 7]Shi Q W,Zhao Z,Zhao K.Hierarchical clustering of Chinese web Int J Sci Eng Technol Res,2012,1(5):127 pages on suffix tree.J Liaoning Tech Univ,2006,25(6):890 20] Gronau I,Moran S.Optimal implementations of UPGMA and (史庆伟，赵政，朝何.一种基于后缀树的中文网页层次聚类 other common clustering algorithms.Inf Process Lett,2007,104 方法.辽宁工程技术大学学报，2006,25(6)：890) (6):205 [8]Aswani Kumar C,Radvansky M,Annapuma J.Analysis of a vee- 1]Zhao Y,Karypis C,Fayyad U.Hierarchical clustering algorithms tor space model,latent semantic indexing and formal concept anal- for document datasets.Data Min Knowl Discor,2005,10 (2): ysis for information retrieval.Cybern Inf Technol,2012,12(1): 141 34 22]Yin Y,Wei C,Zhang G,et al.Implementation of space opti- Wu S H,Cheng Y,Zheng Y N,et al.A survey on text represen- mized bisecting K-means (BKM)based on Hadoop /Proceed- tation and similarity calculation in text clustering.Inf Sci,2012, ings9th Web Information Systems and Applications Conference, 30(4):622 W1S42012:170 (吴夙慧，成颖，郑彦宁，等.文本聚类中文本表示和相似度 [23]Zhao W Z,Ma H F,He Q.Parallel K-means clustering based on 计算研究综述.情报科学，2012,30(4)：622) MapReduce.Lect Notes Comput Sci,2009,5931:674 [10]Hammouda K M,Kamel M S.Efficient phrase-based document 24]Alina E,Sungjin I,Benjamin M.Fast clustering using MapRe- indexing for web document clustering.IEEE Trans Knowl Data duce /Proceedings of the 17th ACM SIGKDD International Con- Eng,2004,16(10):1279 ference on Knowcledge Discovery and Data Mining.New York: [11]Logeswari S,Premalatha K.Biomedical document clustering ACM,2011:681 using ontology based concept weight /2013 International Con- [25]Robson L F,Caetano T J,Agma J M,et al.Clustering very large ference on Computer Communication and Informatics (ICCCI). multi-dimensional datasets with MapReduce /Proceedings of the IEEE,2013:1 17th ACM SIGKDD International Conference on Knowledge Discov- [12]Zhu K B,Tang J,Yang B R.Web text mining system and clus- ery and Data Mining.New York:ACM,2011:690 tering analysis algorithm.Comput Eng,2004,30(13):138 [26]Wan J,Yu W M,Xu X H.Design and implement of distributed (朱克斌，唐菁，杨炳儒.Wb文本挖掘系统及聚类分析算 document clustering based on MapReduce /Proceedings of the 法.计算机工程，2004,30(13)：138) Second Symposium International Computer Science and Computa- [13]Dhillon I S,Modha D S.Concept decompositions for large sparse tional Technology.Huangshanr,2009:278 text data using clustering.Mach Learn,2001,42(1):143 27]Jones K S,Willet P.Readings in Information Retrieval.San [14]Arthur D,Vassilvitskii S.K-means++:the advantages of care Francisco:Morgan Kaufmann Publishers Inc,1997 ful seeding /Proceedings of the 8th Annual ACM-Siam Symposi- 8]Yang Y M.An evaluation of statistical approaches to text catego- um on Discrete Algorithms.Philadelphia,2007:1027 rization.Inf Retr,1999,1(1/2):69第 10 期武森等: 基于 MapＲeduce 的大规模文本聚类并行化 large clusters / / Proceedings of the 6th Symposium on Operating Systems Design． San Francisco，2004: 137 ［3］ Yao Q Y，Liu G S，L X． VSM-based text clustering algorithm． Comput Eng，2008，34( 18) : 39 ( 姚清耘，刘功申，李翔．基于向量空间模型的文本聚类算法．计算机工程，2008，34( 18) : 39) ［4］ Zhang X D，Zhou X H，Hu X H． Semantic smoothing for modelbased document clustering / / Proceedings of the Sixth International Conference on Data Mining． Washington: IEEE Computer Society， 2006: 1193 ［5］ Bharathi G，Venkatesan D． Study of ontology or thesaurus based document clustering and information retrieval． J Theor Appl Inf Technol，2012，40( 1) : 55 ［6］ Ma J，Xu W，Sun Y，et al． An ontology-based text-mining method to cluster proposals for research project selection． IEEE Trans Syst Man Cybern Part A，2012，42( 3) : 784 ［7］ Shi Q W，Zhao Z，Zhao K． Hierarchical clustering of Chinese web pages on suffix tree． J Liaoning Tech Univ，2006，25( 6) : 890 ( 史庆伟，赵政，朝柯．一种基于后缀树的中文网页层次聚类方法．辽宁工程技术大学学报，2006，25( 6) : 890) ［8］ Aswani Kumar C，Ｒadvansky M，Annapurna J． Analysis of a vector space model，latent semantic indexing and formal concept analysis for information retrieval． Cybern Inf Technol，2012，12( 1) : 34 ［9］ Wu S H，Cheng Y，Zheng Y N，et al． A survey on text representation and similarity calculation in text clustering． Inf Sci，2012， 30( 4) : 622 ( 吴夙慧，成颖，郑彦宁，等．文本聚类中文本表示和相似度计算研究综述．情报科学，2012，30( 4) : 622) ［10］ Hammouda K M，Kamel M S． Efficient phrase-based document indexing for web document clustering． IEEE Trans Knowl Data Eng，2004，16( 10) : 1279 ［11］ Logeswari S，Premalatha K． Biomedical document clustering using ontology based concept weight / / 2013 International Conference on Computer Communication and Informatics ( ICCCI) ． IEEE，2013: 1 ［12］ Zhu K B，Tang J，Yang B Ｒ． Web text mining system and clustering analysis algorithm． Comput Eng，2004，30( 13) : 138 ( 朱克斌，唐菁，杨炳儒． Web 文本挖掘系统及聚类分析算法．计算机工程，2004，30( 13) : 138) ［13］ Dhillon I S，Modha D S． Concept decompositions for large sparse text data using clustering． Mach Learn，2001，42( 1) : 143 ［14］ Arthur D，Vassilvitskii S． K-means + + : the advantages of careful seeding / / Proceedings of the 8th Annual ACM-Siam Symposium on Discrete Algorithms． Philadelphia，2007: 1027 ［15］ Zheng W，Ji D，Cai D F，et al． An approach to center selection based on minimal similarity among texts． J Guangxi Norm Univ Nat Sci Ed，2008，26( 3) : 198 ( 郑伟，季铎，蔡东风，等．基于文本互为最小相似度的中心选取方法．广西师范大学学报: 自然科学版，2008，26( 3) : 198) ［16］ Zhong S． Efficient online spherical K-means clustering / / Proceedings of 2005 IEEE International Joint Conference on Neural Networks． IEEE，2005: 3180 ［17］ Schlkopf B，Weston J，Eskin E，et al． A kernel approach for learning from almost orthogonal patterns． Lect Notes Comput Sci， 2002，2431: 494 ［18］ Ding Y，Fu X． The research of text mining based on self-organizing maps． Procedia Eng，2012，29: 537 ［19］ Ceema I，Kavitha M，Ｒenukadevi G，et al． Clustering web documents using hierarchical method for efficient cluster formation． Int J Sci Eng Technol Ｒes，2012，1( 5) : 127 ［20］ Gronau I，Moran S． Optimal implementations of UPGMA and other common clustering algorithms． Inf Process Lett，2007，104 ( 6) : 205 ［21］ Zhao Y，Karypis G，Fayyad U． Hierarchical clustering algorithms for document datasets． Data Min Knowl Discov，2005，10( 2) : 141 ［22］ Yin Y，Wei C，Zhang G，et al． Implementation of space optimized bisecting K-means ( BKM) based on Hadoop / / Proceedings-9th Web Information Systems and Applications Conference， WISA 2012: 170 ［23］ Zhao W Z，Ma H F，He Q． Parallel K-means clustering based on MapＲeduce． Lect Notes Comput Sci，2009，5931: 674 ［24］ Alina E，Sungjin I，Benjamin M． Fast clustering using MapＲeduce / / Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining． New York: ACM，2011: 681 ［25］Ｒobson L F，Caetano T J，Agma J M，et al． Clustering very large multi-dimensional datasets with MapＲeduce / / Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining． New York: ACM，2011: 690 ［26］ Wan J，Yu W M，Xu X H． Design and implement of distributed document clustering based on MapＲeduce / / Proceedings of the Second Symposium International Computer Science and Computational Technology． Huangshanr，2009: 278 ［27］ Jones K S，Willet P．Ｒeadings in Information Ｒetrieval． San Francisco: Morgan Kaufmann Publishers Inc，1997 ［28］ Yang Y M． An evaluation of statistical approaches to text categorization． Inf Ｒetr，1999，1( 1 /2) : 69 · 9141 ·

<<向上翻页

点击下载：基于MapReduce的大规模文本聚类并行化