Data structures and algorithms Hongfei Yan Feb.24,2016
Data Structures and Algorithms Hongfei Yan Feb. 24, 2016
Data structure and algorithm analysis Data structure, methods of organizing large amounts of data Algorithm analysis, the estimation of the running time of algorithms
Data Structure and Algorithm Analysis •Data structure, methods of organizing large amounts of data •Algorithm analysis, the estimation of the running time of algorithms
Data Structure and algorithm analysis Data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently E. g, B-tree for database, hash table for compiler Algorithm analysis is the determination of the amount of resources (such as time and storage necessary to execute them Usually the efficiency or running time of an algorithm is stated as a function relating the input length to the number of steps time complexity or storage locations(space complexity)
Data Structure and Algorithm Analysis • Data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently. • E.g., B-tree for database, • hash table for compiler • Algorithm analysis is the determination of the amount of resources (such as time and storage) necessary to execute them. • Usually, the efficiency or running time of an algorithm is stated as a function relating the input length to the number of steps (time complexity) or storage locations (space complexity)
人类公开资源Key,Ken(may14,2006) Scan This book!) 估计至少有 320万本书,7.50亿篇文章 2千5百首歌,50万部电影 5亿个图像,3百万个视频、电视节目和短片 1000亿网页 超星扫描了200多个图书馆中的130万本中文书, 大约是1949年以来中文出版书籍的一半
人类公开资源(Kelly, Kevin (may 14, 2006). “Scan This Book!”. ) •估计至少有 • 320万本书,7.50亿篇文章 • 2千5百首歌,50万部电影 • 5亿个图像,3百万个视频、电视节目和短片 • 1000亿网页 •超星扫描了200多个图书馆中的130万本中文书, 大约是1949年以来中文出版书籍的一半
Google boc计划 2004年12月开始,目的扫描书和杂志,使用字符识别软 件确认文本的字、词、句和段落,将数字化图像转化为 数据化文本。 ·2013年4月,扫描3千万本 ·2010年,全世界估计有1.3亿本书 ·2008年11月,数字化700本 2007年,数字化100万本 20079月,发布 My Library
Google Books计划 • 2004年12月开始,目的扫描书和杂志,使用字符识别软 件确认文本的字、词、句和段落,将数字化图像转化为 数据化文本。 • 2013年4月,扫描3千万本 • 2010年,全世界估计有1.3亿本书 • 2008年11月,数字化700万本 • 2007年,数字化100万本。 • 2007年9月,发布”My Library
ReCaptcha与数据再利用( Luis von ahn ·人们需要从计算机光学字符识别程序无法识别的文 本扫描项目中读出两个单词并输入。 其中一个单词其他用户也识别过,从而可以从该用户的输入中判 断注册者是人; 另一个单词则是有待辨识和解疑的新词。为了保证准确度,系统 会将同一个模糊单词发给五个不同的人,直到他们都输入正确后 才确定这个单词是对的。 在这里,数据的主要用途是证明用户是人,但它也 有第二个目的:破译数字化文本中不清楚的单词
ReCaptcha与数据再利用(Luis von Ahn) • 人们需要从计算机光学字符识别程序无法识别的文 本扫描项目中读出两个单词并输入。 • 其中一个单词其他用户也识别过,从而可以从该用户的输入中判 断注册者是人; • 另一个单词则是有待辨识和解疑的新词。为了保证准确度,系统 会将同一个模糊单词发给五个不同的人,直到他们都输入正确后 才确定这个单词是对的。 • 在这里,数据的主要用途是证明用户是人,但它也 有第二个目的:破译数字化文本中不清楚的单词
HumanGepokaicshttp://www.int Wikipedia Particle Physics Large Hadron World wide Web(10GB) 200PB-Caplured (~1PB) 1000 CAGR www.an (15PB) //www.intel,cow ARCHIVE Personal Digital Annual email Estimated On-line Internet Archive RAM in Google Photos 3 Traffic, no spam (300pB+) (1PB+ (8PB)(100p8 100%o CAGR 200 of Londons 2004 Walmart Typical Oil Merck bio Traffic Cams Transaction db Company Research DB (8T8/day)501B)(350B+)(L5B/q UPMC Hospitals MIT Babytalk Terashake One Day of Imaging Data Speech Earthquake Model Instant Messaging (500TB/ yr)(1.4PB) of LA Basin (1PB)(750GB) Total digital data to be created this year 270, 000PB ( IDC)
7
cDwww.data.g 开放数据 ≡ATAG0V DATA TOPICS. IMPACT APPLICATIONS DEVELOPERS CONTACT The home of the U.S. Government's open data Www.dAta.Gov Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more 2009年的47个数据集, 2015年3月达到12万, GETSTARTED EARCHOVER 194 824 DATASETS 2016年2月1948万。 Monthly House Price Indexes BROWSE TOPICS 彰后M2 中盒AL Finance Local Manufacturin Public Safety Govemment
开放数据 • www.data.gov • 2009年的47个数据集, 2015年3月达到12万, 2016年2月19.48万
The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East(IDC& EMC, December 2012) The Digital Universe: 50-fold Growth from the Beginning of The Geography of the Digital Universe (2012) 2010 to the End of 2020 United States ● Westem Europe 32% China o India ● Rest of the World 10 092010201120122013201420152016201720182019200 a:2837EB
The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East (IDC & EMC, December 2012)
To promote the development of robust and reusable software take a consistent object-oriented viewpoint data should be presented as being encapsulated with the methods that access and modify them think of data objects as instances of an abstract data type (ADT), which includes a repertoire of methods for performing operations on data objects of this type there may be several different implementation strategies for a particular ADt, and explore the relative pros and cons of these choices
To promote the development of robust and reusable software • take a consistent object-oriented viewpoint • data should be presented as being encapsulated with the methods that access and modify them. • think of data objects as instances of an abstract data type (ADT), which includes a repertoire of methods for performing operations on data objects of this type. • there may be several different implementation strategies for a particular ADT, and explore the relative pros and cons of these choices