数据挖掘大算法产生过程 〉三步鉴定流辊 2>18种退过审薇的候迭 3>算法陈迷 4>数据挖掘10大京:一览 5>开矿论
数据挖掘10大算法产生过程 三步鉴定流程 18种通过审核的候选算法 算法陈述 数据挖掘10大算法:一览 1 2 3 4 5 开放式讨论
1.提名( Nominations) 2006年9月在香港举办的国际会议CDM会议上,邀请 ACM KDD创新大奖 ( nnovation award)和 IEEE ICDM研究贡献奖( Research Contributions award的获奖 者们来参与数据挖掘10大算法的选举,每人提名10种他认为最重要的算法 除一人未参与外,其他获奖者均给出了算法的提名。 每个提名中均需同时给出以下信息 a)算法名称 (b)提名理由摘要 (c)算法的代表性论文 ■每个提名算法都应该被相关领域的研究者广泛引用和使用,每位提名者给出的同类 算法应该是数据挖掘重要应用领域的代表。 2.审核( Verification 在2006年10月,通过 Google Schola对每个提名算法的引用情况进行了审核,从候选名 单中删除了低于50篇论文引用的算法 最终剩下18种提名算法通过了审核,它们分属10类数据挖掘主题 3.投票(otig) ■邀请更多的专业人士来从这些候选算法中投票选出10大算法,他们包括 -(a)KDD06、CDM06和SDM06的程序委员会成员( Program Committee members) (b) ACM KDD创新大奖和 EEICDM研究贡献奖的获奖者们 根据票数排名筛选出10大算法(如果票数相同,则按字母顺序进行排名)
1. 提名 (Nominations) 2006年9月在香港举办的国际会议ICDM会议上,邀请ACM KDD创新大奖 (Innovation Award)和 IEEE ICDM研究贡献奖(Research Contributions Award)的获奖 者们来参与数据挖掘10大算法的选举,每人提名10种他认为最重要的算法。 ◼ 除一人未参与外,其他获奖者均给出了算法的提名。 ◼ 每个提名中均需同时给出以下信息: - (a) 算法名称 - (b) 提名理由摘要 - (c) 算法的代表性论文 ◼ 每个提名算法都应该被相关领域的研究者广泛引用和使用,每位提名者给出的同类 算法应该是数据挖掘重要应用领域的代表。 2. 审核 (Verification) ◼ 在2006年10月,通过Google Scholar对每个提名算法的引用情况进行了审核,从候选名 单中删除了低于50篇论文引用的算法 ◼ 最终剩下18种提名算法通过了审核,它们分属10类数据挖掘主题 ◼ 邀请更多的专业人士来从这些候选算法中投票选出10大算法,他们包括 - (a) KDD-06、ICDM ‘06和SDM ’06的程序委员会成员(Program Committee members) - (b) ACM KDD创新大奖和IEEE ICDM研究贡献奖的获奖者们 ◼ 根据票数排名筛选出10大算法 (如果票数相同,则按字母顺序进行排名) 3. 投票 (Voting)
数据挖掘大算法产生过程 >三步鉴定流程 2>1 3>算法陈迷 4>数据挖掘10大京:一览 5>开矿论
数据挖掘10大算法产生过程 三步鉴定流程 18种通过审核的候选算法 算法陈述 数据挖掘10大算法:一览 1 2 3 4 5 开放式讨论
18种通过审核的候选算法 §分类 Classification 1. C4.5: Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan 5 Kaufmann publishers inc 2. CART: L Breiman. J. Friedman. R. Olshen. and C. Stone. Classification and CAR T Regression Trees. Wadsworth, Belmont, CA, 1984 3. K Nearest Neighbours(kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. IEEE Trans. Pattern Anal. Mach. KNN nte.(TPAM).18,6(Jun.1996),607-616 4. Naive Bayes Hand, D.J., Yu, K, 2001. ldiot's Bayes: Not So Stupid After All? nternat. Statist. Rev. 69. 385-398 Naive B §统计学习( Statistical Learning) 5. Vera o p York n,n1995. The Nature of Statistical Learning Theory. Springer- SVM 6.EM:McLachlan,G and Peel, D(2000). Finite Mixture Models. J. Wiley, New EM York §关联分析( Association Analysis) 7. Apriori: Rakesh Agrawal, Rama krishnan Srikant. Fast Algorithms for Mining non Association rules. In VLDb 94 8. FP-Tree: Han, J, Pei, J, and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD 00 FP-Tree
C4.5 CART Naïve Bayes kNN SVM EM Apriori FP-Tree 18种通过审核的候选算法 §分类(Classification) 1. C4.5: Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc. 2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984. 3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI). 18, 6 (Jun. 1996), 607-616. 4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398. §统计学习(Statistical Learning) 5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. SpringerVerlag New York, Inc. 6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. §关联分析(Association Analysis) 7. Apriori: Rakesh Agrawal,Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. 8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00
18种通过审核的候选算法 §链接挖掘( Link mining) 9. PageRank: Brin, S and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998 PageRank 10 HITS: Kleinberg, J M. 1998. Authoritative sources in a hyperlinked environment. In Proceedings of the Ninth Annual ACM-SIAM Symposium on HIS Discrete Algorithms, 1998 §聚类( Clustering 11.K-Means: MacQueen, J B, Some methods for classification and analysis of multivariate observations, in Proc 5th Berkeley Symp. Mathematical Statistics K-Means and Probability, 1967 12. BIRCH Zhang, T, Ramakrishnan, R, and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD 96 BIRCH §袋装与推进( Bagging and Boosting 13.AdaBoost: Freund, Y and Schapire, R.E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput syst.Sc.55,1(Ag.1997),119-139 Dabao
PageRank HITS K-Means BIRCH AdaBoost §链接挖掘(Link Mining) 9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998. 10.HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998. §聚类(Clustering) 11.K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967. 12.BIRCH Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. §袋装与推进(Bagging and Boosting) 13.AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139. 18种通过审核的候选算法
18种通过审核的候选算法 §序列模式( Sequential patterns) 14.GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns Generalizations and Performance Improvements. In Proceedings of the 5th GSP nternational Conference on EXtending Database Technology, 1996 15.PrefixSpan:J. Pei, J. Han, B Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayaland M-C. Hsu. Prefix Span: Mining Sequential Patterns Efficiently by Prefix I PrefixSpan Projected Pattern Growth In ICDE01 §集成挖掘( Integrated Mining) 16.CBA: Liu, B, Hsu, W. and Ma, Y.M. Integrating classification and association CBA rule mining KDD-98 §粗糙集( Rough Sets) 17. Finding reduct: Zdzislaw Pawlak, Rough sets Theoretical aspects of Reasoning about Data, KluwerAcademic Publishers, Norwell, MA, 1992 Finding reduct §图挖掘( Graph Mining) 18gSpan: Yan, X and Han, J. 2002. gSpan Graph-Based Substructure Pattern Mining. In ICDM 02
GSP PrefixSpan CBA gSpan Finding reduct §序列模式(Sequential Patterns) 14.GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996. 15.PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayaland M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by PrefixProjected Pattern Growth. In ICDE '01. §集成挖掘(Integrated Mining) 16.CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98. §粗糙集(Rough Sets) 17.Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, KluwerAcademic Publishers, Norwell, MA, 1992 §图挖掘(Graph Mining) 18.gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02. 18种通过审核的候选算法
十大经典算法 1.C4.5(ID3算法) ●2.Thek- means algorithm即 K-Means算法 6 3. Support vector machines o 4. The Apriori algorithm ●5.最大期(EM算法 ◆6. Pagerank o7. Adaboost 68 kNN: k-nearest neighbor classification ◆9. Naive Bayes ●10.CART:分类与回归树
十大经典算法 1. C4.5(ID3算法 ) 2. The k-means algorithm 即K-Means算法 3. Support vector machines 4. The Apriori algorithm 5. 最大期望(EM)算法 6. PageRank 7. AdaBoost 8. kNN: k-nearest neighbor classification 9. Naive Bayes 10. CART: 分类与回归树
决策树基础 ●女孩家长 年龄 30 安排相亲 长相 不见 ◆女孩 帅或中等 丑 不厌其烦 收入 不见 ◆女孩 高/啥低 提出决策树 公务 员 不见 ●父母筛选 是不是 候选男士 见 不见 EncZhang's Tech Blog(htp: /eoo2sk cnblogs com)
决策树基础 女孩家长 安排相亲 女孩 不厌其烦 女孩 提出决策树 父母筛选 候选男士
决策树基础 实例 No.头痛肌肉痛体温流感 1是()是()正常0)NO 2是(1)是(①)高()Y(1) 3是(1)是(1)很高(2)Y(1) 4否(0)是(1)正常(0)N0 否(0)否0)高(1)NO) 6否0)是(1)很高(2)|N 7是(1)香0)高(1)X
决策树基础 实例 No. 头痛 肌肉痛 体温 患流感 1 是(1) 是(1) 正常(0) N(0) 2 是(1) 是(1) 高(1) Y(1) 3 是(1) 是(1) 很高(2) Y(1) 4 否(0) 是(1) 正常(0) N(0) 5 否(0) 否(0) 高(1) N(0) 6 否(0) 是(1) 很高(2) N(1) 7 是(1) 否(0) 高(1) Y(1)
生活工作中的决策(做?不做?) 总是优先选取最具有决定性意义的 辅助条件进行判定 如一打不打室外羽毛球? 刮风是最具有决定意义的因素
生活工作中的决策 (做?不做?) •总是优先选取最具有决定性意义的 辅助条件进行判定 如—打不打室外羽毛球? •刮风是最具有决定意义的因素