《多元统计分析》课程教学资源（阅读材料）A Survey on Multivariate Data Visualization.pdf_大学文库

2 Table of Contents Table of Contents 2 Abstract 4 1 Introduction 5 1.1 Motivations………………………………………………………………… 5 1.2 Challenges…………………………………………………………………. 5 2 Concepts and Terminology 6 2.1 Dimensionality……………………………………………………………... 6 2.2 Multidimensional and Multivariate………………………………………… 8 3 Visualization Techniques 8 3.1 Classifications……………………………………………………………… 8 3.2 Geometric Projection………………………………………………………. 8 3.2.1 Scatterplot Matrix………………………………………………… 9 3.2.2 Prosection Matrix………………………………………………… 10 3.2.3 HyperSlice………………………………………………………… 10 3.2.4 Hyperbox………………………………………………………… 11 3.2.5 Parallel Coordinates……………………………………………… 11 3.2.6 Radial Coordinate Visualization………………………………….. 12 3.2.7 Andrews Curve…………………………………………………… 12 3.2.8 Star Coordinates……………………………………………………12 3.2.9 Table lens…………………………………………………………. 13 3.3 Pixel-Oriented Techniques…………………………………………………. 13 3.3.1 Space Filling Curve……………………………………………... 14 3.3.2 Recursive Pattern………………………………………………… 15 3.3.3 Spiral and Axes Techniques……………………………………… 15 3.3.4 Circle Segment…………………………………………………… 16 3.3.5 Pixel Bar Chart…………………………………………………… 16 3.4 Hierarchical Display……………………………………………………….. 17 3.4.1 Hierarchical Axis………………………………………………… 17 3.4.2 Dimensional Stacking……………………………………………. 18 3.4.3 Worlds Within Worlds……………………………………………. 18 3.4.4 Treemap…………………………………………………………… 19

1.Introduction 1.1 Motivations While information is growing in an exponential way,our world is flooded with data which, we believe,should contain some kind of valuable information that can possibly expand the human knowledge.However,extracting the meaningful information is a difficult task when large quantities of data are presented in plain text or traditional tabular form.Effective graphical representations of the data thus enjoy popularity by harnessing the human's visual perception capabilities. Information visualization is the use of computer-based interactive visual representations of abstract and non-physically based data to amplify human cognition.It aims at helping users to effectively detect and explore the expected,as well as discovering the unexpected to gain insight into the data.For multivariate data visualization,the dataset to be visually analyzed is of high dimensionality and these attributes are correlated in some way. Multivariate data are encountered in all aspects by researchers,scientists,engineers, manufacturers,financial managers and various kinds of analysts.Multivariate data visualization is hence strongly motivated by the many situations when they are trying to obtain an integrated understanding of the data distributions and investigate the inter-relationships between different data attributes.Such an effective visual display tool is demanded to facilitate users to identify,locate,distinguish,categorize,cluster,rank,compare, associate or correlate the underlying data [3]. 1.2 Challenges Multivariate data visualization faces the same challenges as information visualization does: Finding good visual representations of a problem can be hard and undeterministic.In addition, multivariate data poses problems in encoding its attributes in a single visual display. Mapping.Finding a suitable mapping of high-dimensional multivariate data into a 2D visual form is never a simple task.It usually depends on the nature of datasets to be visualized and is more related to human perception.Also,association of data attributes to graphical entities requires extreme caution to avoid overwhelming the observer's viewing ability.Conjunction of several elements in the representations may induce cognition overload to the users [6]and graphical attributes should therefore be carefully selected such that they are easy to untangle.It is important that different attributes can be viewed holistically for integrated analysis and,at the same time,each dimension can be judged by users separately and independently

5 1. Introduction 1.1 Motivations While information is growing in an exponential way, our world is flooded with data which, we believe, should contain some kind of valuable information that can possibly expand the human knowledge. However, extracting the meaningful information is a difficult task when large quantities of data are presented in plain text or traditional tabular form. Effective graphical representations of the data thus enjoy popularity by harnessing the human’s visual perception capabilities. Information visualization is the use of computer-based interactive visual representations of abstract and non-physically based data to amplify human cognition. It aims at helping users to effectively detect and explore the expected, as well as discovering the unexpected to gain insight into the data. For multivariate data visualization, the dataset to be visually analyzed is of high dimensionality and these attributes are correlated in some way. Multivariate data are encountered in all aspects by researchers, scientists, engineers, manufacturers, financial managers and various kinds of analysts. Multivariate data visualization is hence strongly motivated by the many situations when they are trying to obtain an integrated understanding of the data distributions and investigate the inter-relationships between different data attributes. Such an effective visual display tool is demanded to facilitate users to identify, locate, distinguish, categorize, cluster, rank, compare, associate or correlate the underlying data [3]. 1.2 Challenges Multivariate data visualization faces the same challenges as information visualization does: Finding good visual representations of a problem can be hard and undeterministic. In addition, multivariate data poses problems in encoding its attributes in a single visual display.  Mapping. Finding a suitable mapping of high-dimensional multivariate data into a 2D visual form is never a simple task. It usually depends on the nature of datasets to be visualized and is more related to human perception. Also, association of data attributes to graphical entities requires extreme caution to avoid overwhelming the observer’s viewing ability. Conjunction of several elements in the representations may induce cognition overload to the users [6] and graphical attributes should therefore be carefully selected such that they are easy to untangle. It is important that different attributes can be viewed holistically for integrated analysis and, at the same time, each dimension can be judged by users separately and independently

Dimensionality.Multivariate data is often of huge size and high dimensionality that will most likely result a dense structure.It is hence difficult to present such data in a single visual display,making it challenging to enable users to explore the data space intuitively and interactively,as well as discriminating individual dimensions.Dual view and distortion skills like fisheyes may be helpful to solve this problem. Furthermore,the ordering of dimensions has a major impact on the expressiveness of visualization [7].Different arrangement allows different conclusions to be drawn, but no ordering principle is established so far. Design Tradeoffs.Visualization can provide a qualitative overview of large and complex datasets so that users can look for structure,features,patterns,trends and relationships more effectively [4].Due to the high dimensionality of multivariate data,we inevitably sacrifice the ability to show the details of each attributes [1]as we have fewer graphic attributes for encoding.This situation may not be flavored when quantitative analysis is required.For multivariate data visualization,there is always a tradeoff between amount of information,simplicity and accuracy. Assessment of Effectiveness.The ultimate goal of multivariate data visualization is to gain insight into the data and show the possible correlation between different attributes.In most cases certain correlations are not yet discovered prior to looking at the visual display,and they are exactly what we want to acquire after visualization. It is a paradox [5]that prohibits the assessment of effectiveness of an information visualization technique:We do not know what valuable knowledge is present in the data,so we hope to gain insight by visualizing it.Nevertheless,if we known nothing about the pattern or relationship to be shown in the data representation,we can never assess the effectiveness of a particular visualization technique. 2.Concepts and Terminology 2.1 Dimensionality Dimensionality of a problem in information visualization refers to the number of attributes,or more generally as variables,that presents in the data to be visualized [2].For one-dimensional data,which is also known as univariate data,consists of only one attributes,such as a collection of houses characterized by the cost.They can be visualized effectively by traditional tools like table and histogram.Interpretation of two-dimensional or bivariate data usually utilizes the x-y coordinates of a 2D space.A conventional approach is to plot one variable against the other called scatterplot,see Figure 2.1. 6

6  Dimensionality. Multivariate data is often of huge size and high dimensionality that will most likely result a dense structure. It is hence difficult to present such data in a single visual display, making it challenging to enable users to explore the data space intuitively and interactively, as well as discriminating individual dimensions. Dual view and distortion skills like fisheyes may be helpful to solve this problem. Furthermore, the ordering of dimensions has a major impact on the expressiveness of visualization [7]. Different arrangement allows different conclusions to be drawn, but no ordering principle is established so far.  Design Tradeoffs. Visualization can provide a qualitative overview of large and complex datasets so that users can look for structure, features, patterns, trends and relationships more effectively [4]. Due to the high dimensionality of multivariate data, we inevitably sacrifice the ability to show the details of each attributes [1] as we have fewer graphic attributes for encoding. This situation may not be flavored when quantitative analysis is required. For multivariate data visualization, there is always a tradeoff between amount of information, simplicity and accuracy.  Assessment of Effectiveness. The ultimate goal of multivariate data visualization is to gain insight into the data and show the possible correlation between different attributes. In most cases certain correlations are not yet discovered prior to looking at the visual display, and they are exactly what we want to acquire after visualization. It is a paradox [5] that prohibits the assessment of effectiveness of an information visualization technique: We do not know what valuable knowledge is present in the data, so we hope to gain insight by visualizing it. Nevertheless, if we known nothing about the pattern or relationship to be shown in the data representation, we can never assess the effectiveness of a particular visualization technique. 2. Concepts and Terminology 2.1 Dimensionality Dimensionality of a problem in information visualization refers to the number of attributes, or more generally as variables, that presents in the data to be visualized [2]. For one-dimensional data, which is also known as univariate data, consists of only one attributes, such as a collection of houses characterized by the cost. They can be visualized effectively by traditional tools like table and histogram. Interpretation of two-dimensional or bivariate data usually utilizes the x-y coordinates of a 2D space. A conventional approach is to plot one variable against the other called scatterplot, see Figure 2.1

The conceptual boundary between low and high dimensionality is not always precisely stated [11].High-dimensional data is used in a loose manner;it can be arbitrarily defined,but it usually depicts a dimensionality of more than four.It is important to observe that geometric projections in more than four-dimensional are ineffective to convey information to human, which is due to the significant differences to perceive between low and high dimensionality. 2.2 Multidimensional and Multivariate The terms multidimensional and multivariate are often used vaguely.Strictly speaking, multidimensional refers to the dimensionality of the independent dimensions while multivariate refers to that of the dependent variables [12].The more appropriate term for multivariate data visualization should be multidimensional multivariate data visualization [13].Nevertheless,a set of multivariate data is in high dimensionality and can possibly be regarded as multidimensional because the key relationships between the attributes are generally unknown in advance.The multidimensional property is therefore implied in common usage For convenience,the term attributes denote both independent dimensions and dependent variables.It also worth noting that multivariate data visualization is rather generic and does not categorize itself clearly between information visualization and scientific visualization. 3.Visualization Techniques 3.1 Classifications Keim and Kriegel [14][15]divided visual data exploration techniques for multidimensional multivariate data into six classes,namely geometric,icon-based,pixel-oriented,hierarchical, graph-based and hybrid techniques.We will adopt this taxonomy and tailor it to multivariate data visualization techniques,which are classified into four broad categories according to the overall approaches taken to generate resulting visualizations [11]:Geometric projection, pixel-oriented techniques,hierarchical display and iconography.They are elaborated in the following sections.Some representative techniques in each group are described in detail. 3.2 Geometric Projection Geometric projection techniques aim at finding informative projections and transformations of multidimensional datasets [14].It may map the attributes to a typical Cartesian plane like scatterplot,or more innovatively to an arbitrary space such as parallel coordinates

8 The conceptual boundary between low and high dimensionality is not always precisely stated [11]. High-dimensional data is used in a loose manner; it can be arbitrarily defined, but it usually depicts a dimensionality of more than four. It is important to observe that geometric projections in more than four-dimensional are ineffective to convey information to human, which is due to the significant differences to perceive between low and high dimensionality. 2.2 Multidimensional and Multivariate The terms multidimensional and multivariate are often used vaguely. Strictly speaking, multidimensional refers to the dimensionality of the independent dimensions while multivariate refers to that of the dependent variables [12]. The more appropriate term for multivariate data visualization should be multidimensional multivariate data visualization [13]. Nevertheless, a set of multivariate data is in high dimensionality and can possibly be regarded as multidimensional because the key relationships between the attributes are generally unknown in advance. The multidimensional property is therefore implied in common usage. For convenience, the term attributes denote both independent dimensions and dependent variables. It also worth noting that multivariate data visualization is rather generic and does not categorize itself clearly between information visualization and scientific visualization. 3. Visualization Techniques 3.1 Classifications Keim and Kriegel [14] [15] divided visual data exploration techniques for multidimensional multivariate data into six classes, namely geometric, icon-based, pixel-oriented, hierarchical, graph-based and hybrid techniques. We will adopt this taxonomy and tailor it to multivariate data visualization techniques, which are classified into four broad categories according to the overall approaches taken to generate resulting visualizations [11]: Geometric projection, pixel-oriented techniques, hierarchical display and iconography. They are elaborated in the following sections. Some representative techniques in each group are described in detail. 3.2 Geometric Projection Geometric projection techniques aim at finding informative projections and transformations of multidimensional datasets [14]. It may map the attributes to a typical Cartesian plane like scatterplot, or more innovatively to an arbitrary space such as parallel coordinates

Methods fall in this category are good for detecting outliers and correlation amongst different dimensions,and handling huge datasets when appropriate interaction techniques are introduced [15].Intrinsically all data attributes are treated equally,but we must be aware that all dimensions may not be perceived equally [2].As the order in which axes are displayed affects our perception [14],rearrangement is important if the display should not be biased. Another potential problem is visual cluttering and record overlapping [14]which overwhelms the user's perception capabilities due to the high dimensionality or the large size of the data. Some typical techniques using geometric projection are discussed next. 3.2.1 Scatterplot Matrix Scatterplot is used for bivariate discrete data in which two attributes are projected along the x-y axes of the Cartesian coordinates.Scatterplot matrix is an extension for multidimensional data where a collection of scatterplots is organized in a matrix simultaneously to provide correlation information among the attributes,see Figure 3.1.We can easily observe patterns in the relationships between pairs of attributes from the matrix,but there may be important patterns in higher dimensions which are barely recognized in it [17].Another limitation is that it becomes chaotic when the number of points,that is the number of data items,is too large 20 10 400 200 4000 2000 1 50 5 20 40 10 20 20040020004000 50100150200 NPG Acceleration Displacement Weight Horsepower Figure 3.1:A scatterplot matrix for 5-dimensional data of 400 automobiles [17]. Fortunately the technique of brushing [18]can be applied to address the above problem. Brushing aims interpretation by highlighting a particular n-dimensional subspace in the visualization [13],that is,the respective points of interested are colored or highlighted in each scatterplot in the matrix.In Figure 3.1,automobiles are color-coded by the number of cylinders.Manufacturers can analyze the performance of the cars based on the number of cylinders for improvements,while customers can decide how many cylinders they need in order to suit their needs. 9

9 Methods fall in this category are good for detecting outliers and correlation amongst different dimensions, and handling huge datasets when appropriate interaction techniques are introduced [15]. Intrinsically all data attributes are treated equally, but we must be aware that all dimensions may not be perceived equally [2]. As the order in which axes are displayed affects our perception [14], rearrangement is important if the display should not be biased. Another potential problem is visual cluttering and record overlapping [14] which overwhelms the user’s perception capabilities due to the high dimensionality or the large size of the data. Some typical techniques using geometric projection are discussed next. 3.2.1 Scatterplot Matrix Scatterplot is used for bivariate discrete data in which two attributes are projected along the x-y axes of the Cartesian coordinates. Scatterplot matrix is an extension for multidimensional data where a collection of scatterplots is organized in a matrix simultaneously to provide correlation information among the attributes, see Figure 3.1. We can easily observe patterns in the relationships between pairs of attributes from the matrix, but there may be important patterns in higher dimensions which are barely recognized in it [17]. Another limitation is that it becomes chaotic when the number of points, that is the number of data items, is too large. Figure 3.1: A scatterplot matrix for 5-dimensional data of 400 automobiles [17]. Fortunately the technique of brushing [18] can be applied to address the above problem. Brushing aims interpretation by highlighting a particular n-dimensional subspace in the visualization [13], that is, the respective points of interested are colored or highlighted in each scatterplot in the matrix. In Figure 3.1, automobiles are color-coded by the number of cylinders. Manufacturers can analyze the performance of the cars based on the number of cylinders for improvements, while customers can decide how many cylinders they need in order to suit their needs