正在加载图片...
第5期 何明,等:基于SQL-on-Hadoop查询引擎的日志挖掘及其应用 ·727. 2)明细查询:如图18所示,对网上交易日志的 机研究与发展.2013,50(1):146-149. 明细查询,实现千亿级数据秒级查询响应。主要用 MENG Xiaofeng,CI Xiang.Big data management: Concepts,techniques and challenges [J].Journal of 于实时明细查询,根据时间、系统、站点等多维条件 computer research and development,2013,50(1): 查询从TB级别的日志数据中快速准确地找到所需 146-149. 数据,查询时效均能达到秒级响应。极大地方便了 [5]JOSHI S B.Apache hadoop performance-tuning methodologies 运维管理人员的工作,在节约大量的时间的同时提 and best practices[C]//Proceedings of the 3rd ACM/SPEC 高了问题排查效率。 Interational Conference on Performance Engineering.New Yok,USA,2012:241-242. 在询条件账号时同间01?/0/0m■⊙-2011/09/10■⊙姑店 哈能名蒋动复国号飘候类型柜台类型所企察号请来时树转时护童业能名格新点国可 [6]LAMB W.The storyteller,the scribe,and a missing man: 货金余税酸货散野 ■10:23:3486 hidden influences from printed sources in the gaelic tales of duncan and neil macdonald J].Oral tradition,2012,27 详细信息 明细内容 (1):109-160. 一功能号: 柜台英型:融验券金账号: 客户名称:账户余额:232L.13情求时间:10:23:34耗时:86e: [7]Apache.org.Apache Chukwa EB/OL].[2017-06-07]. P:4时C: 站点:站点务P:国 http://chukwa.apache.org/ [8]GOODHOPE K.KOSHY J,KREPS J,et al.Building 图18明细查询 LinkedIn's real-time activity data pipeline [J].Data Fig.18 Query details engineering,2012,35(2):33-45. [9]APACHE ORG.Apache Flume EB/OL].[2017-06-07]. 6 结束语 https://flume.apache.org. [10]GHEMAWAAT S,GOBIOFF H,LEUNG S T.The Google 本文研究了SQL-on-Hadoop技术在网络日志分 file system[C]//Proc of the 19th ACM Symp on Operating 析中的应用。我们选取了其中最有代表性的3种 Systems Principles.New York,USA,2003:29-43. SQL查询引擎一Hive、Impala和Spark SQL,并使 [11]THUSOO A,SARMA J S,JAIN N,et al.Hive-a 用TPC-H的测试基准对它们的决策支持能力进行 petabyte scale data warehouse using Hadoop [C]/Proc 测试及评估。构建面向证券行业的网络日志分析 of 2010 IEEE 26th International Conference.Piscataway, 平台,实现万亿级日志存储和高效、灵活的查询系 NJ.2010:996-1005. 统,为海量日志集中分析与管理系统应用提供支 [12]APACHE ORG.Apache HBase EB/OL].[2017-06- 持。目前SQL-on-Hadoop系统还存在若干问题有待 07].https://Hbase.apache.org. [13]APACHE ORG.Hadoop Streaming [EB/OL].[2017-06- 解决,在有限的资源使用情况下和特定数据分布场 07].http://hadoop.apache.org/docs/r1.2.1/streaming. 景下提高查询处理效率等问题都有待进一步的 html. 研究。 [14]WEI J,ZHAO Y,JIANG K,et al.Analysis farm:A 参考文献: cloud-based scalable aggregation and query platform for network log analysis C]//International Conference on [1]OLINER A,GANAPATHI A,XU W.Advances and challenges Cloud and Service Computing.Hong Kong,China,2011: in log analysis [J].Communications of the ACM,2012,55 354-359. (2):55-61 [15]RABKIN A,KATZ R H.Chukwa:a system for reliable [2]李国杰,程学旗.大数据研究:未来科技及经济社会发展 large-scale log collection[C]//International Conference 的重大战略领域一大数据的研究现状与科学思考 on Large Installation System Administration.New York, [J].中国科学院院刊,2012,27(6):647-657. USA,2010:163-177. LI Guojie,CHENG Xueqi.Research status and scientific [16]LOGOTHETIS D,TREZZO C,WEBB K,et al.In-situ thinking of big data[J].Bulletin of Chinese academy of mapreduce for log processing [C]//Usenix Conference on sciences,2012,27(6):647-657. Hot Topics in Cloud Computing.Berkeley,USA,2012: [3]王元卓,靳小龙,程学旗.网络大数据:现状与展望[J]. 26-26. 计算机学报,2013,36(6):1125-1138. [17]TREZZO C J.Continuous mapreduce:an architecture for WANG Yuanzhuo,JIN Xiaolong,CHENG Xueqi.Network large-scale in-situ data processing[J].Dissertations and big data:present and future [J].Chinese journal of theses-gradworks,2010,126(7):14. computer,2013,36(6):1125-1138. [18]Apache.org.HDFS Architecture Guide [EB/OL].[2017- [4]孟小峰,慈祥.大数据管理:概念技术与挑战[J刀.计算 06-07].http://hadoop.apache.org/docs/r2.7.2/hadoop-2)明细查询:如图 18 所示,对网上交易日志的 明细查询,实现千亿级数据秒级查询响应。 主要用 于实时明细查询,根据时间、系统、站点等多维条件 查询从 TB 级别的日志数据中快速准确地找到所需 数据,查询时效均能达到秒级响应。 极大地方便了 运维管理人员的工作,在节约大量的时间的同时提 高了问题排查效率。 图 18 明细查询 Fig.18 Query details 6 结束语 本文研究了 SQL⁃on⁃Hadoop 技术在网络日志分 析中的应用。 我们选取了其中最有代表性的 3 种 SQL 查询引擎———Hive、Impala 和 Spark SQL,并使 用 TPC⁃H 的测试基准对它们的决策支持能力进行 测试及评估。 构建面向证券行业的网络日志分析 平台,实现万亿级日志存储和高效、灵活的查询系 统,为海量日志集中分析与管理系统应用提供支 持。 目前 SQL⁃on⁃Hadoop 系统还存在若干问题有待 解决,在有限的资源使用情况下和特定数据分布场 景下提高查询处理效率等问题都有待进一步的 研究。 参考文献: [1]OLINER A, GANAPATHI A, XU W. Advances and challenges in log analysis [J]. Communications of the ACM, 2012, 55 (2): 55-61. [2]李国杰,程学旗. 大数据研究:未来科技及经济社会发展 的重大战略领域———大数据的研究现状与科学思考 [J]. 中国科学院院刊,2012, 27(6): 647-657. LI Guojie, CHENG Xueqi. Research status and scientific thinking of big data [ J]. Bulletin of Chinese academy of sciences, 2012, 27(6): 647-657. [3]王元卓,靳小龙,程学旗. 网络大数据:现状与展望[ J]. 计算机学报, 2013, 36(6): 1125-1138. WANG Yuanzhuo, JIN Xiaolong, CHENG Xueqi. Network big data: present and future [ J ]. Chinese journal of computer, 2013, 36(6): 1125-1138. [4]孟小峰,慈祥. 大数据管理:概念、技术与挑战[ J]. 计算 机研究与发展, 2013, 50(1): 146-149. MENG Xiaofeng, CI Xiang. Big data management: Concepts, techniques and challenges [ J ]. Journal of computer research and development, 2013, 50 ( 1 ): 146-149. [5]JOSHI S B. Apache hadoop performance⁃tuning methodologies and best practices [C] / / Proceedings of the 3rd ACM/ SPEC International Conference on Performance Engineering. New York, USA, 2012: 241-242. [6]LAMB W. The storyteller, the scribe, and a missing man: hidden influences from printed sources in the gaelic tales of duncan and neil macdonald [ J]. Oral tradition, 2012, 27 (1): 109-160. [7]Apache.org. Apache Chukwa [EB/ OL]. [ 2017- 06- 07]. http: / / chukwa.apache.org / [8] GOODHOPE K, KOSHY J, KREPS J, et al. Building LinkedIn’ s real⁃time activity data pipeline [ J ]. Data engineering, 2012, 35(2): 33-45. [ 9]APACHE ORG. Apache Flume [EB/ OL]. [2017-06-07]. https: / / flume.apache.org. [10]GHEMAWAAT S, GOBIOFF H, LEUNG S T. The Google file system[C] / / Proc of the 19th ACM Symp on Operating Systems Principles. New York, USA, 2003: 29-43. [11 ] THUSOO A, SARMA J S, JAIN N, et al. Hive—a petabyte scale data warehouse using Hadoop [C] / / Proc of 2010 IEEE 26th International Conference. Piscataway, NJ, 2010: 996-1005. [12]APACHE ORG. Apache HBase [ EB/ OL]. [ 2017 - 06 - 07]. https: / / Hbase.apache.org. [13]APACHE ORG. Hadoop Streaming [EB/ OL]. [2017-06- 07]. http: / / hadoop. apache. org / docs/ r1. 2. 1 / streaming. html. [14] WEI J, ZHAO Y, JIANG K, et al. Analysis farm: A cloud⁃based scalable aggregation and query platform for network log analysis [ C ] / / International Conference on Cloud and Service Computing. Hong Kong, China, 2011: 354-359. [15] RABKIN A, KATZ R H. Chukwa: a system for reliable large-scale log collection[C] / / International Conference on Large Installation System Administration. New York , USA, 2010: 163-177. [16] LOGOTHETIS D, TREZZO C, WEBB K, et al. In⁃situ mapreduce for log processing [C] / / Usenix Conference on Hot Topics in Cloud Computing. Berkeley, USA, 2012: 26-26. [17]TREZZO C J. Continuous mapreduce: an architecture for large⁃scale in⁃situ data processing [ J]. Dissertations and theses⁃gradworks, 2010, 126(7): 14. [18]Apache.org. HDFS Architecture Guide [EB/ OL]. [2017- 06- 07]. http: / / hadoop. apache. org / docs/ r2.7.2 / hadoop⁃ 第 5 期 何明,等:基于 SQL⁃on⁃Hadoop 查询引擎的日志挖掘及其应用 ·727·
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有