Structuring data for efficient I/O format compress addr state c/c Structuring data for efficient 1/O Sebastien Ponce sebastien.ponce@cern.ch CERN Thematic CERN School of Computing 2017 1/42 S.Ponce-CERN
Structuring data for efficient I/O 1 / 42 S. Ponce - CERN format compress addr state c/c Structuring data for efficient I/O S´ebastien Ponce sebastien.ponce@cern.ch CERN Thematic CERN School of Computing 2017
Structuring data for efficient l/O format compreas addr state c/c Overall Course Structure Structuring Data for efficient I/O o Data formats,data compression oData addressing Many ways to Store Data o Storage devices and their specificities o Distributing and parallelizing storage Preserving data o Data consistency o Data safety Key ingredients to achieve efficient I/O Synchronous vs asynchronous I/O I/O optimizations and caching 2
Structuring data for efficient I/O 2 / 42 S. Ponce - CERN format compress addr state c/c Overall Course Structure Structuring Data for efficient I/O Data formats, data compression Data addressing Many ways to Store Data Storage devices and their specificities Distributing and parallelizing storage Preserving data Data consistency Data safety Key ingredients to achieve efficient I/O Synchronous vs asynchronous I/O I/O optimizations and caching
Structuring data for efficient I/O format compress addr state c/c Outline ① Data format Row vs Column Compressing data oCompression algorithms Efficiency and use cases Data addressing o Hierarchical namespaces ●Limitations ●Flat namespaces Stateful interfaces ●POSIX ●Limitations o Stateless interfaces Conclusion 3/42 S.Ponce CERN
Structuring data for efficient I/O 3 / 42 S. Ponce - CERN format compress addr state c/c Outline 1 Data format Row vs Column 2 Compressing data Compression algorithms Efficiency and use cases 3 Data addressing Hierarchical namespaces Limitations Flat namespaces 4 Stateful interfaces POSIX Limitations Stateless interfaces 5 Conclusion
Structuring data for efficient I/O format compress addr state c/c Data format 0 Data format o Row vs Column 2 Compressing data Data addressing 年 Stateful interfaces Conclusion row/col 4/42 S.Ponce-CERN
Structuring data for efficient I/O 4 / 42 S. Ponce - CERN format compress addr state c/c row/col Data format 1 Data format Row vs Column 2 Compressing data 3 Data addressing 4 Stateful interfaces 5 Conclusion
Structuring data for efficient I/O format compreas addr state c/c Data structure by example-scenario Scenario o You are measuring temperatures within a piece of detector o You have 10K captors and you take one measure every minute o After a month,you got 432M measures o That is 1.6GB if you take single precision floats(32bits) row/cal 5/42 S.Ponce-CERN
Structuring data for efficient I/O 5 / 42 S. Ponce - CERN format compress addr state c/c row/col Data structure by example - scenario Scenario You are measuring temperatures within a piece of detector You have 10K captors and you take one measure every minute After a month, you got 432M measures That is 1.6GB if you take single precision floats (32bits)
Structuring data for efficient I/O format compress addr stats c/e Data structure by example -row storage Naive structure o You arrange your captors in a sequential order according to the detector geometry Each minute,you create a new "row"of data,with 10K floats representing temperatures given by the captors,in that order Time (mn) Captor 1 Captor 2 Captor c 0 ao bo 20 1 a1 b1 Z1 n an bn Zn File content a0bo.2oa1b1…z1…anbn.zn o 6/42 S.Ponce-CERN
Structuring data for efficient I/O 6 / 42 S. Ponce - CERN format compress addr state c/c row/col Data structure by example - row storage Naive structure You arrange your captors in a sequential order according to the detector geometry Each minute, you create a new “row” of data, with 10K floats representing temperatures given by the captors, in that order Time (mn) Captor 1 Captor 2 ... Captor c 0 a0 b0 ... z0 1 a1 b1 ... z1 ... ... ... ... ... n an bn ... zn File content a0 b0 ... z0 a1 b1 ... z1 ... an bn ... zn
Structuring data for efficient l/O 4 format compress addr state c/e Data structure by example-access Find out overheated devices at a given time o find the offset of that time in the file ●read10 Knumbers o apply simple filter read seek Cost 。one seek o one read of 10K ints This is efficient row/cal 7/42 S.Ponce-CERN
Structuring data for efficient I/O 7 / 42 S. Ponce - CERN format compress addr state c/c row/col Data structure by example - access Find out overheated devices at a given time find the offset of that time in the file read 10K numbers apply simple filter seek read Cost one seek one read of 10K ints This is efficient !
Structuring data for efficient I/O format compre= Data structure by example access (2 Graph the temperature evolution of a given device o read 43.2K numbers from the file,every 40K bytes ●graph them → 下→ ead "read read see seek seek Cost o43.2K reads of 4 bytes and 43.2K seeks o on top typical block size in a filesystem is 8k you will probably read effectively 20%of the file o actually reading the whole file will be more efficient Here the structure of our data is a killer 8/42 S.Ponce-CERN
Structuring data for efficient I/O 8 / 42 S. Ponce - CERN format compress addr state c/c row/col Data structure by example - access (2) Graph the temperature evolution of a given device read 43.2K numbers from the file, every 40K bytes graph them seekread seekread seekread Cost 43.2K reads of 4 bytes and 43.2K seeks ! on top typical block size in a filesystem is 8k you will probably read effectively 20% of the file ! actually reading the whole file will be more efficient Here the structure of our data is a killer
Structuring data for efficient I/O 4 format compress addr state c/c 花5 Column storage Time (mn) Captor 1 Captor 2 Captor c 0 ao bo Zo 1 a1 b1 41 4 。。。 n an bn Zn File content a0a1.an bo b1…bn…z021…Zn Back to efficient read seek read row/cal 9/42 S.Ponce-CERN
Structuring data for efficient I/O 9 / 42 S. Ponce - CERN format compress addr state c/c row/col Column storage Time (mn) Captor 1 Captor 2 ... Captor c 0 a0 b0 ... z0 1 a1 b1 ... z1 ... ... ... ... ... n an bn ... zn File content a0 a1 ... an b0 b1 ... bn ... z0 z1 ... zn Back to efficient read seek read
Structuring data for efficient I/O 4 format compre addr. Row vs column storage Definition Row storage respects internal structure of the data and puts the different items one next in a sequence Column storage breaks the internal structure of the data to collate similar pieces Why to use column o to optimize I/O in general and avoid scattered reads o to optimize data compression o to optimize parallelization of processing Drawback of column storage o a column organized file cannot be updated easily o column storage is usually created from row storage in a postprocessing phase. 10
Structuring data for efficient I/O 10 / 42 S. Ponce - CERN format compress addr state c/c row/col Row vs column storage Definition Row storage respects internal structure of the data and puts the different items one next in a sequence Column storage breaks the internal structure of the data to collate similar pieces Why to use column ? to optimize I/O in general and avoid scattered reads to optimize data compression to optimize parallelization of processing Drawback of column storage a column organized file cannot be updated easily column storage is usually created from row storage in a postprocessing phase