Many ways to store data 4 tfevices distr/c/ Many ways to store data Sebastien Ponce sebastien.ponce@cern.ch CERN Thematic CERN School of Computing 2018 1/42 S.Ponce-CERN
Many ways to store data 1 / 42 S. Ponce - CERN devices distrib // c/c Many ways to store data S´ebastien Ponce sebastien.ponce@cern.ch CERN Thematic CERN School of Computing 2018
Many ways to store data Overall Course Structure Many ways to Store Data o Storage devices and their specificities Distributing and parallelizing storage Preserving data ●Data consistency Data safety Key ingredients to achieve efficient I/O o Synchronous vs asynchronous I/O I/O optimizations and caching 2/42 S.Ponce-CERN
Many ways to store data 2 / 42 S. Ponce - CERN devices distrib // c/c Overall Course Structure Many ways to Store Data Storage devices and their specificities Distributing and parallelizing storage Preserving data Data consistency Data safety Key ingredients to achieve efficient I/O Synchronous vs asynchronous I/O I/O optimizations and caching
Many ways to store data Outline Storage devices ●Existing devices Hierarchical storage ② Distributed storage ●Data distribution ●Data federation ③ Parallelizing files'storage ●Striping Introduction to Map/Reduce Conclusion 3/42 S.Ponce-CERN
Many ways to store data 3 / 42 S. Ponce - CERN devices distrib // c/c Outline 1 Storage devices Existing devices Hierarchical storage 2 Distributed storage Data distribution Data federation 3 Parallelizing files’ storage Striping Introduction to Map/Reduce 4 Conclusion
Many ways to store data 4 devices distn他/∥c Storage devices ①Storage devices ● Existing devices oHierarchical storage Distributed storage Parallelizing files'storage Conclusion oo HSM 4/42 S.Ponce-CERN
Many ways to store data 4 / 42 S. Ponce - CERN devices distrib // c/c zoo HSM Storage devices 1 Storage devices Existing devices Hierarchical storage 2 Distributed storage 3 Parallelizing files’ storage 4 Conclusion
Many ways to store data devices distnb /cft A variety of storage devices Main differences o Capacities from 1 GB to 10TB per unit o Prices from 1 to 300 for the same capacity o Very different reliability oVery different speeds too HSM 5/42 S.Ponce-CERN
Many ways to store data 5 / 42 S. Ponce - CERN devices distrib // c/c zoo HSM A variety of storage devices Main differences Capacities from 1 GB to 10 TB per unit Prices from 1 to 300 for the same capacity Very different reliability Very different speeds Typical numbers in 2018 Capacity per unit Latency $/TB Speed reliability RAM 16 GB 5 ns 9000 ✩ 10 GB s −1 volatile SSD 500 GB 10 ➭s 300 ✩ 550 MB s −1 poor HD 6 TB 3 ms 25 ✩ 150 MB s −1 average Tape 10 TB 100 s 20 ✩ 500 MB s −1 good
Many ways to store data devices distnb //c/ A variety of storage devices Main differences o Capacities from 1 GB to 10TB per unit o Prices from 1 to 300 for the same capacity o Very different reliability o Very different speeds Typical numbers in 2018 Capacity Latency $/TB Speed reliability per unit RAM 16GB 5ns 9000$ 10GBs-1 volatile SSD 500GB 10μs 300$ 550MBs-1 poor HD 6TB 3ms 25$ 150MBs-1 average Tape 10TB 100s 20$ 500MBs-1 good too HSM 5/42 S.Ponce-CERN
Many ways to store data 5 / 42 S. Ponce - CERN devices distrib // c/c zoo HSM A variety of storage devices Main differences Capacities from 1 GB to 10 TB per unit Prices from 1 to 300 for the same capacity Very different reliability Very different speeds Typical numbers in 2018 Capacity per unit Latency $/TB Speed reliability RAM 16 GB 5 ns 9000 ✩ 10 GB s−1 volatile SSD 500 GB 10 ➭s 300 ✩ 550 MB s−1 poor HD 6 TB 3 ms 25 ✩ 150 MB s−1 average Tape 10 TB 100 s 20 ✩ 500 MB s−1 good
Many ways to store data devices distnb //cft 花5 A variety of storage devices You cannot have everything cheap HD Tape SSD RAM reliability speed too HSM 6/42 S.Ponce-CERN
Many ways to store data 6 / 42 S. Ponce - CERN devices distrib // c/c zoo HSM A variety of storage devices You cannot have everything cheap reliability speed RAM SSD HD Tape
Many ways to store data devices distnb Reliability in real world (CERN) For disks probability of losing a disk per year:few %up to 10% with 60K disks,it's around 10 per day and all files are lost o one unrecoverable bit error in 1014 bits read/written for 10GB files,that's one file corrupted per 1000 files written too HSM 7/42 S.Ponce-CERN
Many ways to store data 7 / 42 S. Ponce - CERN devices distrib // c/c zoo HSM Reliability in real world (CERN) For disks probability of losing a disk per year : few %, up to 10% with 60K disks, it’s around 10 per day and all files are lost one unrecoverable bit error in 1014 bits read/written for 10GB files, that’s one file corrupted per 1000 files written For tapes probability of losing a tape per year : 10 −4 and you recover most of the data on it net result is 10 −7 file loss per year one unrecoverable bit error in 10 19 bits read/written for 10GB files, that’s one file corrupted per 100M files written
Many ways to store data devices distnb //c/ Reliability in real world (CERN) For disks ● probability of losing a disk per year:few %up to 10% with 60K disks,it's around 10 per day and all files are lost o one unrecoverable bit error in 1014 bits read/written for 10GB files,that's one file corrupted per 1000 files written For tapes probability of losing a tape per year:10-4 and you recover most of the data on it o net result is 10-7 file loss per year one unrecoverable bit error in 1019 bits read/written for 10GB files,that's one file corrupted per 100M files written too HSM 7/42 S.Ponce-CERN
Many ways to store data 7 / 42 S. Ponce - CERN devices distrib // c/c zoo HSM Reliability in real world (CERN) For disks probability of losing a disk per year : few %, up to 10% with 60K disks, it’s around 10 per day and all files are lost one unrecoverable bit error in 1014 bits read/written for 10GB files, that’s one file corrupted per 1000 files written For tapes probability of losing a tape per year : 10−4 and you recover most of the data on it net result is 10−7 file loss per year one unrecoverable bit error in 1019 bits read/written for 10GB files, that’s one file corrupted per 100M files written
Many ways to store data 4 devices distn/∥ch Practical Mass Storage-Real Big Data when you count in 100s of PetaBytes... The constraints disks or tapes are the only possible solutions odisks are unreliable at that scale,and need redundancy we'll see that extensively tapes are cheaper long term storage by factor 2-2.5 tape latency imposes data access on disk 0o HSM 8/42 S.Ponce-CERN
Many ways to store data 8 / 42 S. Ponce - CERN devices distrib // c/c zoo HSM Practical Mass Storage - Real Big Data when you count in 100s of PetaBytes... The constraints disks or tapes are the only possible solutions disks are unreliable at that scale, and need redundancy we’ll see that extensively tapes are cheaper long term storage by factor 2-2.5 tape latency imposes data access on disk