Data storage and preservation Data storage and preservation Sebastien Ponce sebastien.ponce@cern.ch CERN Thematic CERN School of Computing 2019 1/62 S.Ponce-CERN
Data storage and preservation 1 / 62 S. Ponce - CERN devices // risks consistency safety c/c Data storage and preservation S´ebastien Ponce sebastien.ponce@cern.ch CERN Thematic CERN School of Computing 2019
Data storage and preservation Outline ①Storage devices Existing devices Parallelizing files'storage o Striping Introduction to Map/Reduce Risks of data loss and corruption ④Data consistency o Checksums Practical usage ⑤Data safety oRedundancy Parity o Erasure coding 6 Conclusion 2/62 S.Ponce-CERN
Data storage and preservation 2 / 62 S. Ponce - CERN devices // risks consistency safety c/c Outline 1 Storage devices Existing devices 2 Parallelizing files’ storage Striping Introduction to Map/Reduce 3 Risks of data loss and corruption 4 Data consistency Checksums Practical usage 5 Data safety Redundancy Parity Erasure coding 6 Conclusion
Data storage and preservation Storage devices ①Storage devices o Existing devices Parallelizing files'storage Risks of data loss and corruption Data consistency Data safety Conclusion 3/62 S.Ponce-CERN
Data storage and preservation 3 / 62 S. Ponce - CERN devices // risks consistency safety c/c zoo Storage devices 1 Storage devices Existing devices 2 Parallelizing files’ storage 3 Risks of data loss and corruption 4 Data consistency 5 Data safety 6 Conclusion
Data storage and preservation devices y A variety of storage devices Main differences Capacities from 1GB to 10TB per unit o Prices from 1 to 300 for the same capacity o Very different reliability oVery different speeds 200 4/62 S.Ponce-CERN
Data storage and preservation 4 / 62 S. Ponce - CERN devices // risks consistency safety c/c zoo A variety of storage devices Main differences Capacities from 1 GB to 10 TB per unit Prices from 1 to 300 for the same capacity Very different reliability Very different speeds Typical numbers in 2019 Capacity per unit Latency $/TB Speed reliability RAM 16 GB 10 ns 7000 ✩ 10 GB s −1 volatile SSD 500 GB 10 ➭s 200 ✩ 1 GB s −1 poor HD 6 TB 3 ms 25 ✩ 150 MB s −1 average Tape 20 TB 100 s 20 ✩ 500 MB s −1 good
Data storage and preservation devices A variety of storage devices Main differences o Capacities from 1 GB to 10TB per unit o Prices from 1 to 300 for the same capacity o Very different reliability o Very different speeds Typical numbers in 2019 Capacity Latency $/TB Speed reliability per unit RAM 16GB 10ns 7000$ 10GBs-1 volatile SSD 500GB 10μs 200$ 1GBs-1 poor HD 6TB 3ms 25$ 150MBs-1 average Tape 20TB 100s 20$ 500MBs-1 good 4/62 S.Ponce-CERN
Data storage and preservation 4 / 62 S. Ponce - CERN devices // risks consistency safety c/c zoo A variety of storage devices Main differences Capacities from 1 GB to 10 TB per unit Prices from 1 to 300 for the same capacity Very different reliability Very different speeds Typical numbers in 2019 Capacity per unit Latency $/TB Speed reliability RAM 16 GB 10 ns 7000 ✩ 10 GB s−1 volatile SSD 500 GB 10 ➭s 200 ✩ 1 GB s−1 poor HD 6 TB 3 ms 25 ✩ 150 MB s−1 average Tape 20 TB 100 s 20 ✩ 500 MB s−1 good
Data storage and preservation 花5 A variety of storage devices You cannot have everything cheap HD Tape SSD RAM reliability speed 2o0 5/62 S.Ponce-CERN
Data storage and preservation 5 / 62 S. Ponce - CERN devices // risks consistency safety c/c zoo A variety of storage devices You cannot have everything cheap reliability speed RAM SSD HD Tape
Data storage and preservation 4 devices/7 Reliability in real world (CERN) For disks probability of losing a disk per year:few %up to 10% with 60K disks,it's around 10 per day and all files are lost o one unrecoverable bit error in 1014 bits read/written for 10GB files,that's one file corrupted per 1000 files written 6/62 S.Ponce-CERN
Data storage and preservation 6 / 62 S. Ponce - CERN devices // risks consistency safety c/c zoo Reliability in real world (CERN) For disks probability of losing a disk per year : few %, up to 10% with 60K disks, it’s around 10 per day and all files are lost one unrecoverable bit error in 1014 bits read/written for 10GB files, that’s one file corrupted per 1000 files written For tapes probability of losing a tape per year : 10 −4 and you recover most of the data on it net result is 10 −7 file loss per year one unrecoverable bit error in 10 19 bits read/written for 10GB files, that’s one file corrupted per 100M files written
Data storage and preservation Reliability in real world (CERN) For disks ● probability of losing a disk per year:few %up to 10% with 60K disks,it's around 10 per day and all files are lost o one unrecoverable bit error in 1014 bits read/written for 10GB files,that's one file corrupted per 1000 files written For tapes probability of losing a tape per year:10-4 and you recover most of the data on it o net result is 10-7 file loss per year one unrecoverable bit error in 1019 bits read/written for 10GB files,that's one file corrupted per 100M files written 6/62 S.Ponce-CERN
Data storage and preservation 6 / 62 S. Ponce - CERN devices // risks consistency safety c/c zoo Reliability in real world (CERN) For disks probability of losing a disk per year : few %, up to 10% with 60K disks, it’s around 10 per day and all files are lost one unrecoverable bit error in 1014 bits read/written for 10GB files, that’s one file corrupted per 1000 files written For tapes probability of losing a tape per year : 10−4 and you recover most of the data on it net result is 10−7 file loss per year one unrecoverable bit error in 1019 bits read/written for 10GB files, that’s one file corrupted per 100M files written
Data storage and preservation 花5 Parallelizing files'storage Storage devices 2Parallelizing files'storage Striping o Introduction to Map/Reduce 3 Risks of data loss and corruption Data consistency Data safety Conclusion 世nping mapred 7/62 S.Ponce-CERN
Data storage and preservation 7 / 62 S. Ponce - CERN devices // risks consistency safety c/c striping mapreduce Parallelizing files’ storage 1 Storage devices 2 Parallelizing files’ storage Striping Introduction to Map/Reduce 3 Risks of data loss and corruption 4 Data consistency 5 Data safety 6 Conclusion
Data storage and preservation Why to parallelize storage to work around limitations o individual device speed(think disk) .a file is typically stored on a single device ·network cards'speed 1 Gbit network still present network congestion on a node reduces bandwidth per stream o core network throughput o switches/routers are expensive o machines may have less throughput than their card(s)allow(s) ●hot data congestions o and the black hole it can generate as slower tranfers allow to accumulate more transfers strping mapreduce 8/62 S.Ponce-CERN
Data storage and preservation 8 / 62 S. Ponce - CERN devices // risks consistency safety c/c striping mapreduce Why to parallelize storage ? to work around limitations individual device speed (think disk) a file is typically stored on a single device network cards’ speed 1 Gbit network still present network congestion on a node reduces bandwidth per stream core network throughput switches / routers are expensive machines may have less throughput than their card(s) allow(s) hot data congestions and the black hole it can generate as slower tranfers allow to accumulate more transfers