
LSbM-tree:一个读写兼优的大数据存储结构
LSbM-tree:一个读写兼优的大数据存储结构 1

2MajorDataFormatsinStorageSystems: Sequentially archived data- Indexed data, e.g. sorted data by a defined key, ...- Read/write largely by B+-tree and LSM-tree.Relational tables- Structured data formats for relational databases, e.g. MySQL- Read/write operations by relational algebra/calculus·Key-value store-Apair ofkey/value fora data item, e.g.redis, memcached-Read/write: request -> index -> fetching data·Graph-databases? Free-style text files-A file may be retrieved by KV-store, indexed directory,
Major Data Formats in Storage Systems • Sequentially archived data – Indexed data, e.g. sorted data by a defined key, . – Read/write largely by B+-tree and LSM-tree • Relational tables – Structured data formats for relational databases, e.g. MySQL – Read/write operations by relational algebra/calculus • Key-value store – A pair of key/value for a data item, e.g. redis, memcached – Read/write: request -> index –> fetching data • Graph-databases • Free-style text files – A file may be retrieved by KV-store, indexed directory, . 2

3New Challenges to AccessPerformance in Big Data.Sequentially archived data-Canwemassivelyprocessbothreadsandwritesconcurrently?-butLSM-treefavorswritesandB+-treefavorsreads.Relationaltables- Tables must partitioned/placed among many nodes,e.g.Apache Hive- How to minimize data transfers among nodes and from local disks?·Key-valuestore-Keyindexingbecomesabottleneckas#concurrentrequestsincrease-Howtoacceleratedataaccessesforin-memorykey-valuestore?
New Challenges to Access Performance in Big Data • Sequentially archived data – Can we massively process both reads and writes concurrently? – but LSM-tree favors writes and B+-tree favors reads • Relational tables – Tables must partitioned/placed among many nodes, e.g. Apache Hive – How to minimize data transfers among nodes and from local disks? • Key-value store – Key indexing becomes a bottleneck as # concurrent requests increase – How to accelerate data accesses for in-memory key-value store? 3

Fast Accesses to Sequentially/ArchivedDatain both memory and disks
Fast Accesses to Sequentially Archived Data in both memory and disks 4

What is LSM-tree?It is a Log-structured merge-tree (1996):Multiple levels of sorted data, (e.g. each by a B+ tree)Each level increases exponentially, forming a “pyramid'The smallest level is in memory and the rest on diskCoDRAMHDDC1C2
What is LSM-tree? It is a Log-structured merge-tree (1996): • Multiple levels of sorted data, (e.g. each by a B+ tree) • Each level increases exponentially, forming a “pyramid” • The smallest level is in memory and the rest on disk C0 C1 C2 C0 C1 C2 HDD DRAM 5

LsM-treeiswidely usedinproduction systemsAPACHGoogleBigTableHBASEDTANLBSTMUEUROcassandraRocksDBMariaDBlevelDBb
LSM-tree is widely used in production systems 6 RocksDB

Why Log-structured merge-tree(LSM-tree)?B+treeInsert2.5- In-placeupdateseekRandoml/Os-Lowwritethroughput2.5dddad,LSM-treewrites- Log-structuredupdate-Merge/compactionforsortingDRAM-Sequentiall/OsHDDmerges-Highwritethroughput
Why Log-structured merge-tree (LSM-tree)? • B+ tree – In-place update – Random I/Os – Low write throughput • LSM-tree – Log-structure for sequential I/O – Merge for sorting – Batched I/O operations – High write throughput 7 2.5 d2 ’ seek Insert 2.5 • LSM-tree – Log-structured update – Merge/compaction for sorting – Sequential I/Os – High write throughput 7 writes merges HDD DRAM

8SQLite Experiences of LSM-tree by Richard HippLog StructuredMerge(LSM)BadGood-Slowerreads.FasterwritesBackgroundmergeReduced writeprocessamplificationMorespaceondisk-Linearwrites.Greatercomplexity.LessSSDwear
SQLite Experiences of LSM-tree by Richard Hipp 8

AbuffercacheproblemreportedbyCassandrain2014CASSANDRA-6916-readoaC*2.1 beta1 vs beta2110,000effective caching100,00090,00080.00070,00060.00050,00warmingup40.00030,00020,000disenabled caching10,000250t5o20010060Total operation time (seconds)*https://www.datastax.com/dev/blog/compaction-improvements-in-cassandra-219
A buffer cache problem reported by Cassandra in 2014 9 *https://www.datastax.com/dev/blog/compaction-improvements-in-cassandra-21 warming up effective caching disenabled caching

BasicFunctionsofBufferCacheBuffer cache is in DRAM or other fast devicesData entries in buffer cache refer to disk blocks (page)Cache the frequently read disk blocks for reuseBuffer cache.disk2a10
Basic Functions of Buffer Cache • Buffer cache is in DRAM or other fast devices • Data entries in buffer cache refer to disk blocks (page) • Cache the frequently read disk blocks for reuse 10 b c a disk Buffer cache a c b