当前位置:高等教育资讯网  >  中国高校课件下载中心  >  大学文库  >  浏览文档

重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 2 about data - Getting to Know Your Data

资源类别:文库,文档格式:PPT,文档页数:42,文件大小:1.06MB,团购合买
◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Data Visualization ◼ Measuring Data Similarity and Dissimilarity ◼ Summary
点击下载完整版文档(PPT)

Chapter 2: Getting to Know Your Data Data Objects and Attribute Types Basic statistical Descriptions of Data Data visualization Measuring Data Similarity and dissimilarity Summary

1 Chapter 2: Getting to Know Your Data ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Data Visualization ◼ Measuring Data Similarity and Dissimilarity ◼ Summary

Types of Data Sets Record Relational records Data matrix, e.g. numerical matrix, crosstabs Document data: text documents: term frequency vector Document 1 Transaction data graph and network Document 2 0 00 Vorld wide Web Document 3 00 2 3 Social or information networks Molecular structures Ordered TD tems Video data: sequence of images Bread. Coke. Milk Temporal data: time-series Beer. bread Sequential Data: transaction sequences Genetic sequence data 1234 Beer, Coke, Diaper, Milk patial, image and multimedia Beer, Bread, Diaper, Milk Spatial data: maps Coke, Diaper, Milk Image da Video data 2

2 Types of Data Sets ◼ Record ◼ Relational records ◼ Data matrix, e.g., numerical matrix, crosstabs ◼ Document data: text documents: term￾frequency vector ◼ Transaction data ◼ Graph and network ◼ World Wide Web ◼ Social or information networks ◼ Molecular Structures ◼ Ordered ◼ Video data: sequence of images ◼ Temporal data: time-series ◼ Sequential Data: transaction sequences ◼ Genetic sequence data ◼ Spatial, image and multimedia: ◼ Spatial data: maps ◼ Image data: ◼ Video data: Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

Important Characteristics of Structured Data Dimensionality Curse of dimensionality Sparsity Only presence counts Resolution Patterns depend on the scale a Distribution Centrality and dispersion

3 Important Characteristics of Structured Data ◼ Dimensionality ◼ Curse of dimensionality ◼ Sparsity ◼ Only presence counts ◼ Resolution ◼ Patterns depend on the scale ◼ Distribution ◼ Centrality and dispersion

Data Objects Data sets are made up of data objects a data object represents an entity Examples: sales database: customers store items, sales medical database: patients, treatments university database: students, professors, courses Also called samples, examples, instances, data points, objects, tuples. Data objects are described by attributes Database rows->data objects columns->attributes

4 Data Objects ◼ Data sets are made up of data objects. ◼ A data object represents an entity. ◼ Examples: ◼ sales database: customers, store items, sales ◼ medical database: patients, treatments ◼ university database: students, professors, courses ◼ Also called samples , examples, instances, data points, objects, tuples. ◼ Data objects are described by attributes. ◼ Database rows -> data objects; columns ->attributes

Attributes Attribute(or dimensions features, variables) a data field representing a characteristic or feature of a data object -E.g, customer_1D, name address ■ Types Nominal Binary Numeric: quantitative Interval-scaled Ratio-scaled

5 Attributes ◼ Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. ◼ E.g., customer _ID, name, address ◼ Types: ◼ Nominal ◼ Binary ◼ Numeric: quantitative ◼ Interval-scaled ◼ Ratio-scaled

Attribute Types Nominal: categories, states or " names of things Hair color=auburn, black blond, brown, grey red whitey marital status, occupation, ID numbers zip codes Bina iry Nominal attribute with only 2 states(0 and 1) Symmetric binary: both outcomes equally important e.g. gender Asymmetric binary: outcomes not equally important. e.g., medical test(positive vs, negative Convention assign 1 to most important outcome(e.g, HIV positive Ordinal Values have a meaningful order(ranking but magnitude between successive values is not known Size =tsmall, medium, large grades, army rankings

6 Attribute Types ◼ Nominal: categories, states, or “names of things” ◼ Hair_color = {auburn, black, blond, brown, grey, red, white} ◼ marital status, occupation, ID numbers, zip codes ◼ Binary ◼ Nominal attribute with only 2 states (0 and 1) ◼ Symmetric binary: both outcomes equally important ◼ e.g., gender ◼ Asymmetric binary: outcomes not equally important. ◼ e.g., medical test (positive vs. negative) ◼ Convention: assign 1 to most important outcome (e.g., HIV positive) ◼ Ordinal ◼ Values have a meaningful order (ranking) but magnitude between successive values is not known. ◼ Size = {small, medium, large}, grades, army rankings

Numeric Attribute Types Quantity(integer or real-valued) Interval Measured on a scale of equal-sized units Values have order E.g temperature in C or F calendar dates No true zero-point Ratio Inherent zero-point We can speak of values as being an order of magnitude larger than the unit of measurement (10K° is twice as high as5K°) e.g. temperature in Ke/vin, length, counts, monetary quantities

7 Numeric Attribute Types ◼ Quantity (integer or real-valued) ◼ Interval ◼ Measured on a scale of equal-sized units ◼ Values have order ◼ E.g., temperature in C˚or F˚, calendar dates ◼ No true zero-point ◼ Ratio ◼ Inherent zero-point ◼ We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). ◼ e.g., temperature in Kelvin, length, counts, monetary quantities

Discrete vs Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values E.g., zip codes, profession, or the set of words in a collection of documents Sometimes represented as integer variables Note: Binary attributes are a special case of discrete attributes Continuous attribute Has real numbers as attribute values E.g. temperature, height or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables

8 Discrete vs. Continuous Attributes ◼ Discrete Attribute ◼ Has only a finite or countably infinite set of values ◼ E.g., zip codes, profession, or the set of words in a collection of documents ◼ Sometimes, represented as integer variables ◼ Note: Binary attributes are a special case of discrete attributes ◼ Continuous Attribute ◼ Has real numbers as attribute values ◼ E.g., temperature, height, or weight ◼ Practically, real values can only be measured and represented using a finite number of digits ◼ Continuous attributes are typically represented as floating-point variables

Basic Statistical Descriptions of Data ■ Motivation a To better understand the data: central tendency variation and spread data dispersion characteristics median, max, min quantiles, outliers, variance etc. a Numerical dimensions correspond to sorted intervals Data dispersion analyzed with multiple granularities of precision a boxplot or quantile analysis on sorted intervals a Dispersion analysis on computed measures a Folding measures into numerical dimensions a Boxplot or quantile analysis on the transformed cube

9 Basic Statistical Descriptions of Data ◼ Motivation ◼ To better understand the data: central tendency, variation and spread ◼ Data dispersion characteristics ◼ median, max, min, quantiles, outliers, variance, etc. ◼ Numerical dimensions correspond to sorted intervals ◼ Data dispersion: analyzed with multiple granularities of precision ◼ Boxplot or quantile analysis on sorted intervals ◼ Dispersion analysis on computed measures ◼ Folding measures into numerical dimensions ◼ Boxplot or quantile analysis on the transformed cube

Measuring the Central Tendency Mean(algebraic measure)(sample vs. population: x=∑ ∑x Note: n is sample size and / is population size. N a Weighted arithmetic mean Trimmed mean chopping extreme values Middle value if odd number of values, or average of Median: the middle two values otherwise requency 1-5 200 a Estimated by interpolation(for grouped data) 6-15 450 n/2-C∑freq 16-20 300 )width 21-50 1500 Mode fred median 5180 700 a value that occurs most frequently in the data 81-110 44 a Unimodal bimodal, trimodal Empirical formula: mean-mode=3x(mean-median)

10 Measuring the Central Tendency ◼ Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. ◼ Weighted arithmetic mean: ◼ Trimmed mean: chopping extreme values ◼ Median: ◼ Middle value if odd number of values, or average of the middle two values otherwise ◼ Estimated by interpolation (for grouped data): ◼ Mode ◼ Value that occurs most frequently in the data ◼ Unimodal, bimodal, trimodal ◼ Empirical formula: = = n i xi n x 1 1   = = = n i i n i i i w w x x 1 1 width freq n freq l median L median ) / 2 ( ) ( 1 = + −  mean − mode = 3(mean − median) N x  =

点击下载完整版文档(PPT)VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
共42页,可试读14页,点击继续阅读 ↓↓
相关文档

关于我们|帮助中心|下载说明|相关软件|意见反馈|联系我们

Copyright © 2008-现在 cucdc.com 高等教育资讯网 版权所有