Chapter 2: Getting to Know Your Data Data Objects and Attribute Types Basic statistical Descriptions of Data Data visualization Measuring Data Similarity and dissimilarity Summary
1 Chapter 2: Getting to Know Your Data ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Data Visualization ◼ Measuring Data Similarity and Dissimilarity ◼ Summary
Types of Data Sets Record Relational records Data matrix, e.g. numerical matrix, crosstabs Document data: text documents: term frequency vector Document 1 Transaction data graph and network Document 2 0 00 Vorld wide Web Document 3 00 2 3 Social or information networks Molecular structures Ordered TD tems Video data: sequence of images Bread. Coke. Milk Temporal data: time-series Beer. bread Sequential Data: transaction sequences Genetic sequence data 1234 Beer, Coke, Diaper, Milk patial, image and multimedia Beer, Bread, Diaper, Milk Spatial data: maps Coke, Diaper, Milk Image da Video data 2
2 Types of Data Sets ◼ Record ◼ Relational records ◼ Data matrix, e.g., numerical matrix, crosstabs ◼ Document data: text documents: termfrequency vector ◼ Transaction data ◼ Graph and network ◼ World Wide Web ◼ Social or information networks ◼ Molecular Structures ◼ Ordered ◼ Video data: sequence of images ◼ Temporal data: time-series ◼ Sequential Data: transaction sequences ◼ Genetic sequence data ◼ Spatial, image and multimedia: ◼ Spatial data: maps ◼ Image data: ◼ Video data: Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
Important Characteristics of Structured Data Dimensionality Curse of dimensionality Sparsity Only presence counts Resolution Patterns depend on the scale a Distribution Centrality and dispersion
3 Important Characteristics of Structured Data ◼ Dimensionality ◼ Curse of dimensionality ◼ Sparsity ◼ Only presence counts ◼ Resolution ◼ Patterns depend on the scale ◼ Distribution ◼ Centrality and dispersion
Data Objects Data sets are made up of data objects a data object represents an entity Examples: sales database: customers store items, sales medical database: patients, treatments university database: students, professors, courses Also called samples, examples, instances, data points, objects, tuples. Data objects are described by attributes Database rows->data objects columns->attributes
4 Data Objects ◼ Data sets are made up of data objects. ◼ A data object represents an entity. ◼ Examples: ◼ sales database: customers, store items, sales ◼ medical database: patients, treatments ◼ university database: students, professors, courses ◼ Also called samples , examples, instances, data points, objects, tuples. ◼ Data objects are described by attributes. ◼ Database rows -> data objects; columns ->attributes
Attributes Attribute(or dimensions features, variables) a data field representing a characteristic or feature of a data object -E.g, customer_1D, name address ■ Types Nominal Binary Numeric: quantitative Interval-scaled Ratio-scaled
5 Attributes ◼ Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. ◼ E.g., customer _ID, name, address ◼ Types: ◼ Nominal ◼ Binary ◼ Numeric: quantitative ◼ Interval-scaled ◼ Ratio-scaled
Attribute Types Nominal: categories, states or " names of things Hair color=auburn, black blond, brown, grey red whitey marital status, occupation, ID numbers zip codes Bina iry Nominal attribute with only 2 states(0 and 1) Symmetric binary: both outcomes equally important e.g. gender Asymmetric binary: outcomes not equally important. e.g., medical test(positive vs, negative Convention assign 1 to most important outcome(e.g, HIV positive Ordinal Values have a meaningful order(ranking but magnitude between successive values is not known Size =tsmall, medium, large grades, army rankings
6 Attribute Types ◼ Nominal: categories, states, or “names of things” ◼ Hair_color = {auburn, black, blond, brown, grey, red, white} ◼ marital status, occupation, ID numbers, zip codes ◼ Binary ◼ Nominal attribute with only 2 states (0 and 1) ◼ Symmetric binary: both outcomes equally important ◼ e.g., gender ◼ Asymmetric binary: outcomes not equally important. ◼ e.g., medical test (positive vs. negative) ◼ Convention: assign 1 to most important outcome (e.g., HIV positive) ◼ Ordinal ◼ Values have a meaningful order (ranking) but magnitude between successive values is not known. ◼ Size = {small, medium, large}, grades, army rankings
Numeric Attribute Types Quantity(integer or real-valued) Interval Measured on a scale of equal-sized units Values have order E.g temperature in C or F calendar dates No true zero-point Ratio Inherent zero-point We can speak of values as being an order of magnitude larger than the unit of measurement (10K° is twice as high as5K°) e.g. temperature in Ke/vin, length, counts, monetary quantities
7 Numeric Attribute Types ◼ Quantity (integer or real-valued) ◼ Interval ◼ Measured on a scale of equal-sized units ◼ Values have order ◼ E.g., temperature in C˚or F˚, calendar dates ◼ No true zero-point ◼ Ratio ◼ Inherent zero-point ◼ We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). ◼ e.g., temperature in Kelvin, length, counts, monetary quantities
Discrete vs Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values E.g., zip codes, profession, or the set of words in a collection of documents Sometimes represented as integer variables Note: Binary attributes are a special case of discrete attributes Continuous attribute Has real numbers as attribute values E.g. temperature, height or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables
8 Discrete vs. Continuous Attributes ◼ Discrete Attribute ◼ Has only a finite or countably infinite set of values ◼ E.g., zip codes, profession, or the set of words in a collection of documents ◼ Sometimes, represented as integer variables ◼ Note: Binary attributes are a special case of discrete attributes ◼ Continuous Attribute ◼ Has real numbers as attribute values ◼ E.g., temperature, height, or weight ◼ Practically, real values can only be measured and represented using a finite number of digits ◼ Continuous attributes are typically represented as floating-point variables
Basic Statistical Descriptions of Data ■ Motivation a To better understand the data: central tendency variation and spread data dispersion characteristics median, max, min quantiles, outliers, variance etc. a Numerical dimensions correspond to sorted intervals Data dispersion analyzed with multiple granularities of precision a boxplot or quantile analysis on sorted intervals a Dispersion analysis on computed measures a Folding measures into numerical dimensions a Boxplot or quantile analysis on the transformed cube
9 Basic Statistical Descriptions of Data ◼ Motivation ◼ To better understand the data: central tendency, variation and spread ◼ Data dispersion characteristics ◼ median, max, min, quantiles, outliers, variance, etc. ◼ Numerical dimensions correspond to sorted intervals ◼ Data dispersion: analyzed with multiple granularities of precision ◼ Boxplot or quantile analysis on sorted intervals ◼ Dispersion analysis on computed measures ◼ Folding measures into numerical dimensions ◼ Boxplot or quantile analysis on the transformed cube
Measuring the Central Tendency Mean(algebraic measure)(sample vs. population: x=∑ ∑x Note: n is sample size and / is population size. N a Weighted arithmetic mean Trimmed mean chopping extreme values Middle value if odd number of values, or average of Median: the middle two values otherwise requency 1-5 200 a Estimated by interpolation(for grouped data) 6-15 450 n/2-C∑freq 16-20 300 )width 21-50 1500 Mode fred median 5180 700 a value that occurs most frequently in the data 81-110 44 a Unimodal bimodal, trimodal Empirical formula: mean-mode=3x(mean-median)
10 Measuring the Central Tendency ◼ Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. ◼ Weighted arithmetic mean: ◼ Trimmed mean: chopping extreme values ◼ Median: ◼ Middle value if odd number of values, or average of the middle two values otherwise ◼ Estimated by interpolation (for grouped data): ◼ Mode ◼ Value that occurs most frequently in the data ◼ Unimodal, bimodal, trimodal ◼ Empirical formula: = = n i xi n x 1 1 = = = n i i n i i i w w x x 1 1 width freq n freq l median L median ) / 2 ( ) ( 1 = + − mean − mode = 3(mean − median) N x =