Data Technologies-CERN School af Compuing 2019 Data Technologes-CERN School of Computing 2019 Agenda R CERN School of Computing Introduction to data management Data Workllows in scientific computing ◆Scra3 e Models Data management components Name Servers and databases 1st lecture ◆Data Access protocols Data Technologies ◆Reliability Aarabery Access Control and Security 2nd lecture .Crvptoo单aghr Alberto Pace ◆Scalability 3rlecture alberto.pace@cern.ch ·Clud storage CERN Data and Storage Services Group ,日ock storage 4t lecture ◆Data Replication ◆Data Caching 5th lecture ◆Monitoring.Alarms ◆Quota Summary Data Technologles-CERN School af Compuang 2019 -CERN School of Computing 2019 The mission of CERN R CERN School ofComputing CERN uniting people Introduction to data management Research We are here Discovery Accelerating particle Detecting particles Large-scale beams (experiments) computing (Analysis
1 Data Technologies – CERN School of Computing 2019 Data Technologies Alberto Pace alberto.pace@cern.ch CERN Data and Storage Services Group 2 Data Technologies – CERN School of Computing 2019 Agenda Introduction to data management Data Workflows in scientific computing Storage Models Data management components Name Servers and databases Data Access protocols Reliability Availability Access Control and Security Cryptography Authentication, Authorization, Accounting Scalability Cloud storage Block storage Analytics Data Replication Data Caching Monitoring, Alarms Quota Summary 1 st lecture 2 nd lecture 3 rd lecture 4 th lecture 5 th lecture 3 Data Technologies – CERN School of Computing 2019 Introduction to data management 4 Data Technologies – CERN School of Computing 2019 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission of CERN
Data Technologies-CERN School af Compuaing 2019 Data Technologes-CERN School of Computing 2019 The need for computing in research The need for storage in computing Scientific research in recent years has exploded Scientific computing for large experiments is the computing requirements typically based on a distributed infrastructure Computing has been the strategy to reduce the Storage is one of the main pillars cost of traditional research Storage requires Data Management... At constant cost,exponential growth of performances Scientific Computing Computing has opened new horizons of research not only in High Energy Physics Return in computing investment higher than other fields:Budget available for computing increased, growth is more than exponential Data Technologles-CERN School af Compuang 2019 Data Technologies-CERN School of Computing 2019 Why”data management? Can we make it simple Data Management solves the following problems A simple storage model:all data into the same ◆Data reliability container ◆Access control .Uniform,simple,easy to manage,no need to move data Can provide sufficient level of performance and reliability ◆Data distribution Data archives,history,long term preservation ◆In general: Cloud"Storage Empower the implementation of a workflow for data processing
5 Data Technologies – CERN School of Computing 2019 The need for computing in research Scientific research in recent years has exploded the computing requirements Computing has been the strategy to reduce the cost of traditional research Computing has opened new horizons of research not only in High Energy Physics At constant cost, exponential growth of performances Return in computing investment higher than other fields: Budget available for computing increased, growth is more than exponential 6 Data Technologies – CERN School of Computing 2019 The need for storage in computing Scientific computing for large experiments is typically based on a distributed infrastructure Storage is one of the main pillars Storage requires Data Management… DATA CPU NET Scientific Computing 7 Data Technologies – CERN School of Computing 2019 “Why” data management ? Data Management solves the following problems Data reliability Access control Data distribution Data archives, history, long term preservation In general: Empower the implementation of a workflow for data processing 8 Data Technologies – CERN School of Computing 2019 Can we make it simple ? A simple storage model: all data into the same container Uniform, simple, easy to manage, no need to move data Can provide sufficient level of performance and reliability “Cloud” Storage For large repositories, it is too simplistic !
Data Technologies-CERN School af Compuaing 2019 Data Technologes-CERN School of Computing 2019 Why multiple pools and quality So,..what is data management Derived data used for analysis and accessed by Examples from LHC experiment data models thousands of nodes Need high performance.Low cost,minimal rellability (derived data can be recalculated) Raw data that need to be analyzed Need high performance.High reliability,can be expensive (small sizes) Raw data that has been analyzed and archived Must be low cost (huge volumes).High reliability (must be preserved),perlormanoe not necessary .Two building blocks to empower data processing Data pools with different quality of services Tools for data transfer between pools Data Technologles-CERN School af Compuang 2019 Data Technologles-CERN School of Computing 2019 Data pools But the balance is not as simple Different quality of services Many ways to split(performance,reliability,cost) Three parameters:(Performance,Reliability,Cost) Performance You can have two but not three Cost Reliability Expensive Performance has many sub-parameters Flash,Solld State Disks Cost has many sub-parameters -Mirrored disks Reliability has many sub-parameters Tapes Disks Scalability Electrical consumption Slow Unreliable Latency Ops Cost Throughput Consistency HW cost (manpower)
9 Data Technologies – CERN School of Computing 2019 Why multiple pools and quality ? Derived data used for analysis and accessed by thousands of nodes Need high performance, Low cost, minimal reliability (derived data can be recalculated) Raw data that need to be analyzed Need high performance, High reliability, can be expensive (small sizes) Raw data that has been analyzed and archived Must be low cost (huge volumes), High reliability (must be preserved), performance not necessary 10 Data Technologies – CERN School of Computing 2019 So, … what is data management ? Examples from LHC experiment data models Two building blocks to empower data processing Data pools with different quality of services Tools for data transfer between pools 11 Data Technologies – CERN School of Computing 2019 Data pools Different quality of services Three parameters: (Performance, Reliability, Cost) You can have two but not three Slow Expensive Unreliable Tapes Disks Flash, Solid State Disks Mirrored disks 12 Data Technologies – CERN School of Computing 2019 But the balance is not as simple Many ways to split (performance, reliability, cost) Performance has many sub-parameters Cost has many sub-parameters Reliability has many sub-parameters Reliability Performance Latency / Throughput Scalability Electrical consumption HW cost Ops Cost (manpower) Consistency Cost
Data Technologies-CERN School af Compuaing 2019 Data Technologes-CERN School of Computing 2019 (Sc And reality is complicated Where are we heading? Key requirements:Simple,Scalable,Consistent,Reliable, Software solutions Cheap hardware Available,Manageable,Flexible,Performing,Cheap,Secure. Aiming for"a la carte"services (storage pools)with on-demand “quality of service” .And where is scalability E ensive Mirrored disks Software dafined service cheap hardware Disks Slow Unreliable Slow Unreliable B-Pooiz Data Technologles-CERN School af Compuang 2019 Data Technologies-CERN School of Computing 2019 Agenda ERN CERN School ofComputing Name Servers and databases Data Management Components
13 Data Technologies – CERN School of Computing 2019 And reality is complicated Key requirements: Simple, Scalable, Consistent, Reliable, Available, Manageable, Flexible, Performing, Cheap, Secure. Aiming for “à la carte” services (storage pools) with on-demand “quality of service” And where is scalability ? 0 10 20 30 40 50 60 70 80 Read throughput Write throughput Read Latency Write Latency Scalability Consistency Metadata Read throughput Metadata Write throughput Metadata Read Latency Metadata Write Latency Pool1 Pool2 14 Data Technologies – CERN School of Computing 2019 Where are we heading ? Software solutions + Cheap hardware Slow Expensive Unreliable Tapes Disks Flash, Solid State Disks Mirrored disks Slow Expensive Unreliable Software defined service + cheap hardware 16 Data Technologies – CERN School of Computing 2019 Data Management Components 17 Data Technologies – CERN School of Computing 2019 Agenda Introduction to data management Data Workflows in scientific computing Storage Models Data management components Name Servers and databases Data Access protocols Reliability Availability Access Control and Security Cryptography Authentication, Authorization, Accounting Scalability Cloud storage Block storage Analytics Data Replication Data Caching Monitoring, Alarms Quota Summary
Data Technologies-CERN School of Compuaing 2019 Data Technologes-CERN School of Computing 2019 Name Server Criticality of the name server performance .The name server is"the"database of a managed storage Every meta-data operation requires a database system which contains the catalogue of all data(typically transaction. all files) It is a simple lookup-based,single-key,database ◆It is essential to understand where the“name application for which several implementation exists server"approach is placed... DNS(domain name server)software The name server lookup time dictates the ◆LDAP databases performance of the whole storage system Hash tables /Object databases The database becomes the bottleneck of the Relational Databases entire storage process:low performances are a Name server reliability is critical symptom of major architectural mismatch Name server failure brings down the whole storage system Comment:Cloud storage An architecture that Name server performance is critical replaces the name server DB lookup with a ◆See next slide. "calculated"name resolution (..more to come...) Data Technologles-CERN School of Computing 2019 Short digression on.… Uniform Resource Identifiers(URI) Similar problem in storage systems ◆Example from the web. Example from storage... http://csc.cern.ch/data/2012/School/page.htm storage://cern.ch/data/2012/School/page.htm ↑ 个 ↑↑ ↑ ↑↑ protocol host/domain volume folder/directory file protocol host/domain volume folder/directory file Where is the database lookup when accessing a web page In several implementation,the database lookup is at the host domain level. placed at the“fle”level Every host has its own namespace,managed Impacts all operations,including most popular locally. open()and stat() ◆Excellent example of“federated”namespace Great flexibility but huge performance hit,which Extremely efficient,but some limitations implies more hardware and constant database http://www.ietf.org/rfc/rfc2396.txt tuning
18 Data Technologies – CERN School of Computing 2019 Name Server The name server is “the” database of a managed storage system which contains the catalogue of all data (typically all files) It is a simple lookup-based, single-key, database application for which several implementation exists DNS (domain name server) software LDAP databases Hash tables / Object databases Relational Databases Name server reliability is critical Name server failure brings down the whole storage system Name server performance is critical See next slide … 19 Data Technologies – CERN School of Computing 2019 Criticality of the name server performance Every meta-data operation requires a database transaction. It is essential to understand where the “name server” approach is placed ... The name server lookup time dictates the performance of the whole storage system The database becomes the bottleneck of the entire storage process: low performances are a symptom of major architectural mismatch Comment: Cloud storage ? An architecture that replaces the name server DB lookup with a “calculated” name resolution (… more to come …) 20 Data Technologies – CERN School of Computing 2019 Short digression on ... Uniform Resource Identifiers (URI) Example from the web ... http://csc.cern.ch/data/2012/School/page.htm http://www.ietf.org/rfc/rfc2396.txt protocol host / domain volume folder / directory file Where is the database lookup when accessing a web page ? at the host / domain level. Every host has its own namespace, managed locally. Excellent example of “federated” namespace Extremely efficient, but some limitations 21 Data Technologies – CERN School of Computing 2019 Similar problem in storage systems In several implementation, the database lookup is placed at the “file” level Impacts all operations, including most popular open() and stat() Great flexibility but huge performance hit, which implies more hardware and constant database tuning Example from storage ... storage://cern.ch/data/2012/School/page.htm protocol host / domain volume folder / directory file
Data Technologies-CERN School af Compuing 2019 Data Technologes-CERN School of Computing 2019 Between extremes... Agenda There are intermediate solutions between File systems(no database) Storage systems with database lookup placed the "file"level Examples of high performance scalable solutions Data Access protocols ·AFS/DFS .Database placed at the domain host level (same as the web) Very scalable .But within the domain (eg:'cern.ch").identical to a file system with physical files directly mapped to logical filenames XROOTD Scalla/NFS/clustered storage cloud storage ,进山山 Database (somehow)placed at the volume level (this is a simplified statemenf) Similar scalability.with more flexibility in terms of data management with physical files 'calculated"from logical filenames not requiring database lookup below the volume"level ◆Federated storage Data Technologles-CERN School af Compuang 2019 Data Technologies-CERN School of Computing 2019 Access protocols Dataflows across sites File-level(posix)access is the starting point Storage in scientific computing is distributed Open,Stat.Read,Write,Delete,.. across multiple data centres ◆Several extensions that are“implementation specific”which cannot be mapped to posix calls Data flows from the experiments to all Pre-staging data from slow storage into fast storage datacenters where there is CPU available to Managing pool creation sizes process the data Reading or changing access permissions Interpretation of extended data attributes and meta data Tier-0 Some parts of the posix standard are not scalable (Is,chdir,...) (sclentific Experiments) Not all storage systems implements posix entirely Various protocols offering file access rfio (Castor,DPM.).dcap(dCache).xroot(Scalla.Castor.DPM, Tier-1 Tier-1 Tier-1 EOS.).NFS.AFS.S3.... Various protocols for bulk data movements over wide area networks Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tler-2 ◆GridFTP
22 Data Technologies – CERN School of Computing 2019 Between extremes... There are intermediate solutions between File systems (no database) Storage systems with database lookup placed the “file” level Examples of high performance scalable solutions AFS / DFS Database placed at the domain / host level (same as the web) Very scalable But within the domain (eg: “cern.ch”), identical to a file system with physical files directly mapped to logical filenames XROOTD / Scalla / NFS / clustered storage / cloud storage Database (somehow) placed at the volume level (this is a simplified statement) Similar scalability, with more flexibility in terms of data management with physical files “calculated” from logical filenames not requiring database lookup below the “volume” level Federated storage 23 Data Technologies – CERN School of Computing 2019 Agenda Introduction to data management Data Workflows in scientific computing Storage Models Data management components Name Servers and databases Data Access protocols Reliability Availability Access Control and Security Cryptography Authentication, Authorization, Accounting Scalability Cloud storage Block storage Analytics Data Replication Data Caching Monitoring, Alarms Quota Summary 24 Data Technologies – CERN School of Computing 2019 Access protocols File-level (posix) access is the starting point Open, Stat, Read, Write, Delete, ... Several extensions that are “implementation specific” which cannot be mapped to posix calls Pre-staging data from slow storage into fast storage Managing pool creation / sizes Reading or changing access permissions Interpretation of extended data attributes and meta data Some parts of the posix standard are not scalable (ls, chdir, … ) Not all storage systems implements posix entirely Various protocols offering file access rfio (Castor, DPM, ...), dcap (dCache), xroot (Scalla, Castor, DPM, EOS ...), NFS, AFS, S3, … Various protocols for bulk data movements over wide area networks GridFTP, ... 25 Data Technologies – CERN School of Computing 2019 Dataflows across sites Storage in scientific computing is distributed across multiple data centres Data flows from the experiments to all datacenters where there is CPU available to process the data Tier-0 (scientific Experiments) Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-1 Tier-1 Tier-1 Tier-1
Data Technologies-CERN School of Compuaing 2019 Data Technologes-CERN School of Computing 2019 Efficiency Data distribution .A key parameter in distributed scientific Analysis made with high efficiency requires the data to be pre- computing is the efficiency located to where the CPUs are available High efficiency requires to have the CPUs colocated with the data to analyze,using the network. ◆Whenever a site has.. Tier-0 idle CPUs (because no data is available to (Scientific Experments) process) or excess of Data (because there is no CPU left Tier-1 Tier-1 Tier-1 Tier-1 for analysis) ldle or saturated networks ...the efficiency drops Tier-2 Tier-2 Tier-2 Tier-2 Tler-2 Tier-2 Tier-2 er-2 Data Technologles-CERN School af Compuang 2019 Data Technologles-CERN School of Computing 2019 Data distribution Data distribution Analysis made with high efficiency requires the data to be pre- Both approaches coexists in High Energy Physics located to where the CPUs are available ◆Data is pre-placed Or to allow peer-to peer data transfer This allows sites with excess of CPU,to schedule the pre-fetching This is the role of the experiments that plans the analysis of data when missing locally or to access it remotely if the analysis Data is globally accessible and federated in a application has been designed to cope with high latency global namespace .The middleware always attempt to take the local data and Tier-0 uses an access protocol that redirects to the nearest remote (Scientific Experiments) copy when the local data is not available All middleware and jobs are designed to minimize the impact er-1 of the additional latency that the redirection requires Using access protocols that allows global data federation is essential Tier-2Tier-2 Tier-2 Tier-2 Tier-2Tier-2 Tier-2 Ter-2 ◆http,xroot
26 Data Technologies – CERN School of Computing 2019 Efficiency A key parameter in distributed scientific computing is the efficiency High efficiency requires to have the CPUs colocated with the data to analyze, using the network. Whenever a site has … idle CPUs (because no data is available to process) or excess of Data (because there is no CPU left for analysis) Idle or saturated networks … the efficiency drops 27 Data Technologies – CERN School of Computing 2019 Data distribution Analysis made with high efficiency requires the data to be prelocated to where the CPUs are available Tier-0 (Scientific Experiments) Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-1 Tier-1 Tier-1 Tier-1 28 Data Technologies – CERN School of Computing 2019 Data distribution Analysis made with high efficiency requires the data to be prelocated to where the CPUs are available Or to allow peer-to peer data transfer This allows sites with excess of CPU, to schedule the pre-fetching of data when missing locally or to access it remotely if the analysis application has been designed to cope with high latency Tier-0 (Scientific Experiments) Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-1 Tier-1 Tier-1 Tier-1 29 Data Technologies – CERN School of Computing 2019 Data distribution Both approaches coexists in High Energy Physics Data is pre-placed This is the role of the experiments that plans the analysis Data is globally accessible and federated in a global namespace The middleware always attempt to take the local data and uses an access protocol that redirects to the nearest remote copy when the local data is not available All middleware and jobs are designed to minimize the impact of the additional latency that the redirection requires Using access protocols that allows global data federation is essential http, xroot
Data Technologies-CERN School af Compuing 2019 Data Technologes-CERN School of Computing 2019 Agenda Storage Reliability Reliability is related to the probability to lose data ◆ga3t鱼 Def:"the probability that a storage device will perform an arbitrarily large number of 1/O operations without data loss during a specified period of time" ◆Reliability of the“service'”depends on the environment(energy, ◆Reliabiity cooling,people,...) Avolabihe Will not discuss this further Reliability of the "service"starts from the reliability of the underlying hardware Example of disk servers with simple disks:reliability of service ochi reliability of disks But data management solutions can increase the reliability of the hardware at the expenses of performance and/or additional hardware software ◆Disk Mirroring Redundant Array of Inexpensive Disks (RAID) Data Technologles-CERN School af Compuang 2019 Data Technologies-CERN School of Computing 2019 Hardware reliability Reminder:types of RAID ◆Do we need tapes? Tapes have a bad reputation in some use cases ◆RAID0 Slow in random access mode ◆Disk striping high latency in mounting process and when seeking data (F.FWD,REW) Inefficient for small files (in some cases) ◆RAID1 Comparable cost per (peta)byte as hard disks ◆Disk mirroring .Tapes have also some advantages ◆RAID5 Fast in sequential access mode >2xfaster than disk,with physical read after wrie verrcation Parity information is distributed across all disks Several orders of magnitude more reliable than disks .Few hundreds GB loss per year on 80 P8 tape repository ◆RAID6 .Few hundreds TB loss per year an 50 PB disk repostory No power required to preserve the data Uses Reed-Solomon error correction,allowing the Less physical volume required per (peta)byte loss of 2 disks in the array without data loss Inefficiency for small fles issue resolved by recent developments Nobody can delete hundreds of PB in minutes Bottom line:if not used for random access,tapes have a clear role in the architecture 中on.wkipeda.o9 GRAID 33
30 Data Technologies – CERN School of Computing 2019 Agenda Introduction to data management Data Workflows in scientific computing Storage Models Data management components Name Servers and databases Data Access protocols Reliability Availability Access Control and Security Cryptography Authentication, Authorization, Accounting Scalability Cloud storage Block storage Analytics Data Replication Data Caching Monitoring, Alarms Quota Summary 31 Data Technologies – CERN School of Computing 2019 Storage Reliability Reliability is related to the probability to lose data Def: “the probability that a storage device will perform an arbitrarily large number of I/O operations without data loss during a specified period of time” Reliability of the “service” depends on the environment (energy, cooling, people, ...) Will not discuss this further Reliability of the “service” starts from the reliability of the underlying hardware Example of disk servers with simple disks: reliability of service = reliability of disks But data management solutions can increase the reliability of the hardware at the expenses of performance and/or additional hardware / software Disk Mirroring Redundant Array of Inexpensive Disks (RAID) 32 Data Technologies – CERN School of Computing 2019 Hardware reliability Do we need tapes ? Tapes have a bad reputation in some use cases Slow in random access mode high latency in mounting process and when seeking data (F-FWD, REW) Inefficient for small files (in some cases) Comparable cost per (peta)byte as hard disks Tapes have also some advantages Fast in sequential access mode > 2x faster than disk, with physical read after write verification Several orders of magnitude more reliable than disks Few hundreds GB loss per year on 80 PB tape repository Few hundreds TB loss per year on 50 PB disk repository No power required to preserve the data Less physical volume required per (peta)byte Inefficiency for small files issue resolved by recent developments Nobody can delete hundreds of PB in minutes Bottom line: if not used for random access, tapes have a clear role in the architecture 33 Data Technologies – CERN School of Computing 2019 Reminder: types of RAID RAID0 Disk striping RAID1 Disk mirroring RAID5 Parity information is distributed across all disks RAID6 Uses Reed–Solomon error correction, allowing the loss of 2 disks in the array without data loss http://en.wikipedia.org/wiki/RAID
Data Technologies-CERN School af Compuaing 2019 Data Technologes-CERN School of Computing 2019 Reminder:types of RAID Reminder:types of RAID RAID O RAID 1 ◆RAID0 ◆RAID0 ◆Disk striping ◆Disk striping ◆RAID1 AB ◆RAID1 ◆Disk mirroring ◆Disk mirroring ◆RAID5 Disk 0 ◆RAID5 Disk 1 Parity information is distributed across all disks Parity information is distributed across all disks ◆RAID6 ◆RAID6 Uses Reed-Solomon error correction,allowing the Uses Reed-Solomon error correction,allowing the loss of 2 disks in the array without data loss loss of 2 disks in the array without data loss httpcnen.wilipedia.oro/niRAD htp:Men.mkipedia.org/iRAID Data Technologles-CERN Schodl af Compuang 2019 Data Technologies-CERN School of Computing 2019 Reminder:types of RAID Reminder:types of RAID RAID 4 8888-0 RAID 6 ◆RAID0 RAID 5 ◆RAID0 ◆Disk striping ◆Disk striping ◆RAID1 ◆RAID1 ◆Disk mirroring ◆Disk mirroring ◆RAID5 ◆RAID5 Disk 1 Parity information is distributed across all disks Parity information is distributed across all disks ◆RAID6 ◆RAID6 Uses Reed-Solomon error correction,allowing the Uses Reed-Solomon error correction,allowing the loss of 2 disks in the array without data loss loss of 2 disks in the array without data loss httpanen.wilpedia.oro/iRAD 中en.wkipeda.o9 GRAID
34 Data Technologies – CERN School of Computing 2019 Reminder: types of RAID RAID0 Disk striping RAID1 Disk mirroring RAID5 Parity information is distributed across all disks RAID6 Uses Reed–Solomon error correction, allowing the loss of 2 disks in the array without data loss http://en.wikipedia.org/wiki/RAID 35 Data Technologies – CERN School of Computing 2019 Reminder: types of RAID RAID0 Disk striping RAID1 Disk mirroring RAID5 Parity information is distributed across all disks RAID6 Uses Reed–Solomon error correction, allowing the loss of 2 disks in the array without data loss http://en.wikipedia.org/wiki/RAID 36 Data Technologies – CERN School of Computing 2019 Reminder: types of RAID RAID0 Disk striping RAID1 Disk mirroring RAID5 Parity information is distributed across all disks RAID6 Uses Reed–Solomon error correction, allowing the loss of 2 disks in the array without data loss http://en.wikipedia.org/wiki/RAID 37 Data Technologies – CERN School of Computing 2019 Reminder: types of RAID RAID0 Disk striping RAID1 Disk mirroring RAID5 Parity information is distributed across all disks RAID6 Uses Reed–Solomon error correction, allowing the loss of 2 disks in the array without data loss http://en.wikipedia.org/wiki/RAID
Data Technologies-CERN School af Compuaing 2019 Data Technologes-CERN School of Computing 2019 Understanding error correction If we lose some information .. A line is defined by 2 numbers:a,b If we transmit more than 2 points,we can lose (a,b)is the information any point,provided the total number of point left ◆y=ax+b is>=2 Instead of transmitting a and b,transmit some points on the line at known abscissa.2 points define a line.If I transmit more points,these should be aligned. 1 point instead of 2 2 points instead of 3 2 or 3 points instead of 4 2 points 3 points 4 points information lost Data Technologles-CERN School af Compuang 2019 Data Technologles-CERN School of Computing 2019 If we have an error... If you have checksumming on data .. If there is an error,I can detect it if I have You can detect errors by verifying the transmitted more than 2 points,and correct it if consistency of the data with the respective have transmitted more than 3 points checksums.So you can detect errors independently. ..and use all redundancy for error correction Informason lost Error detection Error comection Infommation lost 2 Eror corrections possible (and you do not notice) Information is lost Information is recovered (and you notice) Infommation is recovered Information is recovered (and you notice)
38 Data Technologies – CERN School of Computing 2019 Understanding error correction A line is defined by 2 numbers: a, b (a, b) is the information y = ax + b Instead of transmitting a and b, transmit some points on the line at known abscissa. 2 points define a line. If I transmit more points, these should be aligned. 2 points 3 points 4 points 39 Data Technologies – CERN School of Computing 2019 2 points instead of 3 2 or 3 points instead of 4 1 point instead of 2 information lost If we lose some information … If we transmit more than 2 points, we can lose any point, provided the total number of point left is >= 2 ? 40 Data Technologies – CERN School of Computing 2019 If we have an error … If there is an error, I can detect it if I have transmitted more than 2 points, and correct it if have transmitted more than 3 points Information lost (and you do not notice) Error detection Information is lost (and you notice) Error correction Information is recovered ? 41 Data Technologies – CERN School of Computing 2019 If you have checksumming on data … You can detect errors by verifying the consistency of the data with the respective checksums. So you can detect errors independently. … and use all redundancy for error correction Information lost (and you notice) Error correction Information is recovered 2 Error corrections possible Information is recovered ?