Data Cleaning a Data in the real World Is Dirty Lots of potentially incorrect data, e. g instrument faulty, human or computer error, transmission error incomplete: lacking attribute values lacking certain attributes of interest, or containing only aggregate data o e. g, Occupation=(missing data noisy: containing noise, errors, or outliers n e.g., Salary="-10(an error) inconsistent: containing discrepancies in codes or names, e.g 口Age=42, Birthday=“03/07/2010 n Was rating"1, 2, 3, now rating, B, C a discrepancy between duplicate records Intentional(e.g, disguised missing data) a Jan. 1 as everyone's birthday 6 同济大学软件学院 ool of Software Engineering. Tongpi Unversity6 Data Cleaning ◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error ◆ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation=“ ” (missing data) ◆ noisy: containing noise, errors, or outliers e.g., Salary=“−10” (an error) ◆ inconsistent: containing discrepancies in codes or names, e.g., Age=“42”, Birthday=“03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” discrepancy between duplicate records ◆ Intentional (e.g., disguised missing data) Jan. 1 as everyone’s birthday?