Motivation Very large volumes of data being collected Driven by growth of web,social media,and more recently internet-of- things Web logs were an early source of data Analytics on web logs has great value for advertisements,web site structuring,what posts to show to a user,etc ■ Big Data:differentiated from data handled by earlier generation databases Volume:much larger amounts of data stored Velocity:much higher rates of insertions Variety:many types of data,beyond relational data Database System Concepts-7th Edition 10.2 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.2 ©Silberschatz, Korth and Sudarshan th Edition Motivation ▪ Very large volumes of data being collected • Driven by growth of web, social media, and more recently internet-ofthings • Web logs were an early source of data ▪ Analytics on web logs has great value for advertisements, web site structuring, what posts to show to a user, etc ▪ Big Data: differentiated from data handled by earlier generation databases • Volume: much larger amounts of data stored • Velocity: much higher rates of insertions • Variety: many types of data, beyond relational data
Querying Big Data Transaction processing systems that need very high scalability Many applications willing to sacrifice ACID properties and other database features,if they can get very high scalability Query processing systems that Need very high scalability,and Need to support non-relation data Database System Concepts-7th Edition 10.3 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.3 ©Silberschatz, Korth and Sudarshan th Edition Querying Big Data ▪ Transaction processing systems that need very high scalability • Many applications willing to sacrifice ACID properties and other database features, if they can get very high scalability ▪ Query processing systems that • Need very high scalability, and • Need to support non-relation data
Big Data Storage Systems Distributed file systems Shardring across multiple databases Key-value storage systems Parallel and distributed databases Database System Concepts-7th Edition 10.4 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.4 ©Silberschatz, Korth and Sudarshan th Edition Big Data Storage Systems ▪ Distributed file systems ▪ Shardring across multiple databases ▪ Key-value storage systems ▪ Parallel and distributed databases
Distributed File Systems A distributed file system stores data across a large collection of machines, but provides single file-system view Highly scalable distributed file system for large data-intensive applications. E.g.,10K nodes,100 million files,10 PB Provides redundant storage of massive amounts of data on cheap and unreliable computers Files are replicated to handle hardware failure Detect failures and recovers from them Examples: Google File System(GFS) Hadoop File System (HDFS) Database System Concepts-7th Edition 10.5 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.5 ©Silberschatz, Korth and Sudarshan th Edition Distributed File Systems ▪ A distributed file system stores data across a large collection of machines, but provides single file-system view ▪ Highly scalable distributed file system for large data-intensive applications. • E.g., 10K nodes, 100 million files, 10 PB ▪ Provides redundant storage of massive amounts of data on cheap and unreliable computers • Files are replicated to handle hardware failure • Detect failures and recovers from them ▪ Examples: • Google File System (GFS) • Hadoop File System (HDFS)
Hadoop File System Architecture Single Namespace for entire NameNode Metadata(name,replicas,..) cluster Metadata Ops Files are broken up into BackupNode blocks Metadata(name,replicas..) Client Typically 64 MB block size Block Read DataNodes Each block replicated on multiple DataNodes L Blocks Client 。 Finds location of blocks Client from NameNode Block Write Accesses data directly Replication from DataNode Rack 1 Rack 2 Database System Concepts-7th Edition 10.6 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 10.6 ©Silberschatz, Korth and Sudarshan th Edition Hadoop File System Architecture ▪ Single Namespace for entire cluster ▪ Files are broken up into blocks • Typically 64 MB block size • Each block replicated on multiple DataNodes ▪ Client • Finds location of blocks from NameNode • Accesses data directly from DataNode
Hadoop Distributed File System(HDFS) NameNode Maps a filename to list of Block IDs Maps each Block ID to DataNodes containing a replica of the block DataNode:Maps a Block ID to a physical location on disk Data Coherency Write-once-read-many access model Client can only append to existing files Distributed file systems good for millions of large files But have very high overheads and poor performance with billions of smaller tuples Database System Concepts-7th Edition 10.7 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.7 ©Silberschatz, Korth and Sudarshan th Edition Hadoop Distributed File System (HDFS) ▪ NameNode • Maps a filename to list of Block IDs • Maps each Block ID to DataNodes containing a replica of the block ▪ DataNode: Maps a Block ID to a physical location on disk ▪ Data Coherency • Write-once-read-many access model • Client can only append to existing files ▪ Distributed file systems good for millions of large files • But have very high overheads and poor performance with billions of smaller tuples
Sharding Sharding:partition data across multiple databases Partitioning usually done on some partitioning attributes(also known as partitioning keys or shard keys e.g.user ID E.g.,records with key values from 1 to 100,000 on database 1, records with key values from 100,001 to 200,000 on database 2,etc. Application must track which records are on which database and send queries/updates to that database Positives:scales well,easy to implement Drawbacks: Not transparent:application has to deal with routing of queries, queries that span multiple databases When a database is overloaded,moving part of its load out is not easy Chance of failure more with more databases need to keep replicas to ensure availability,which is more work for application Database System Concepts-7th Edition 10.8 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 10.8 ©Silberschatz, Korth and Sudarshan th Edition Sharding ▪ Sharding: partition data across multiple databases ▪ Partitioning usually done on some partitioning attributes (also known as partitioning keys or shard keys e.g. user ID • E.g., records with key values from 1 to 100,000 on database 1, records with key values from 100,001 to 200,000 on database 2, etc. ▪ Application must track which records are on which database and send queries/updates to that database ▪ Positives: scales well, easy to implement ▪ Drawbacks: • Not transparent: application has to deal with routing of queries, queries that span multiple databases • When a database is overloaded, moving part of its load out is not easy • Chance of failure more with more databases ▪ need to keep replicas to ensure availability, which is more work for application
Key Value Storage Systems Key-value storage systems store large numbers(billions or even more)of small (KB-MB)sized records Records are partitioned across multiple machines and Queries are routed by the system to appropriate machine Records are also replicated across multiple machines,to ensure availability even if a machine fails Key-value stores ensure that updates are applied to all replicas,to ensure that their values are consistent Database System Concepts-7th Edition 10.10 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 10.10 ©Silberschatz, Korth and Sudarshan th Edition Key Value Storage Systems ▪ Key-value storage systems store large numbers (billions or even more) of small (KB-MB) sized records ▪ Records are partitioned across multiple machines and ▪ Queries are routed by the system to appropriate machine ▪ Records are also replicated across multiple machines, to ensure availability even if a machine fails • Key-value stores ensure that updates are applied to all replicas, to ensure that their values are consistent
Key Value Storage Systems Key-value stores may store uninterpreted bytes,with an associated key E.g.,Amazon S3,Amazon Dynamo Wide-table(can have arbitrarily many attribute names)with associated key Google BigTable,Apache Cassandra,Apache Hbase, Amazon DynamoDB Allows some operations(e.g.,filtering)to execute on storage node ·JSON MongoDB,CouchDB(document model) Document stores store semi-structured data,typically JSON Some key-value stores support multiple versions of data,with timestamps/version numbers Database System Concepts-7th Edition 10.11 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.11 ©Silberschatz, Korth and Sudarshan th Edition Key Value Storage Systems ▪ Key-value stores may store • uninterpreted bytes, with an associated key ▪ E.g., Amazon S3, Amazon Dynamo • Wide-table (can have arbitrarily many attribute names) with associated key • Google BigTable, Apache Cassandra, Apache Hbase, Amazon DynamoDB • Allows some operations (e.g., filtering) to execute on storage node • JSON ▪ MongoDB, CouchDB (document model) ▪ Document stores store semi-structured data, typically JSON ▪ Some key-value stores support multiple versions of data, with timestamps/version numbers
Data Representation An example of a JSON object is: { "ID":"22222", "name": "firstname:"Albert", "lastname:"Einstein" }, "deptname":"Physics", "children": {"firstname":"Hans","lastname":"Einstein"}, "firstname":"Eduard","lastname":"Einstein"} } Database System Concepts-7th Edition 10.12 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.12 ©Silberschatz, Korth and Sudarshan th Edition Data Representation ▪ An example of a JSON object is: { "ID": "22222", "name": { "firstname: "Albert", "lastname: "Einstein" }, "deptname": "Physics", "children": [ { "firstname": "Hans", "lastname": "Einstein" }, { "firstname": "Eduard", "lastname": "Einstein" } ] }