A Survey on Multivariate Data Visualization Winnie Wing-Yi Chan Department of Computer Science and Engineering Hong Kong University of Science and Technology Clear Water Bay,Kowloon,Hong Kong June 2006
A Survey on Multivariate Data Visualization Winnie Wing-Yi Chan Department of Computer Science and Engineering Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong June 2006
Table of Contents Table of Contents Abstract 1 Introduction 5 1.1 Motivations........... 5 1.2 Challenges...… 5 2 Concepts and Terminology 6 2.1 Dimensionality.........……… 6 2.2 Multidimensional and Multivariate....................................... 8 3 Visualization Techniques 8 3.1 Classifications.… P 3.2 Geometric Projection...... 8 3.2.1 Scatterplot Matrix.............. 9 3.2.2 Prosection Matrix 10 3.2.3 HyperSlice. 。 10 3.2.4 Hyperbox.… 11 3.2.5 Parallel Coordinates...... 11 3.2.6 Radial Coordinate Visualization.... 12 3.2.7 Andrews Curve.…… 12 3.2.8 Star Coordinates..…..…….. 2 3.2.9 Table lens.…… 13 3.3 Pixel-Oriented Techniques.................................................... 13 3.3.1 Space Filling Curve........... 14 3.3.2 Recursive Pattern................ ………………………… 15 3.3.3 Spiral and Axes Techniques. 15 3.3.4 Circle Segment.… 16 3.3.5 Pixel Bar Chart....…. 16 3.4 Hierarchical Display.… 17 3.4.1 Hierarchical Axis................ 17 3.4.2 Dimensional Stacking............................................ 18 3.4.3 Worlds Within Worlds............ 18 3.4.4 Treemap....… 19 2
2 Table of Contents Table of Contents 2 Abstract 4 1 Introduction 5 1.1 Motivations………………………………………………………………… 5 1.2 Challenges…………………………………………………………………. 5 2 Concepts and Terminology 6 2.1 Dimensionality……………………………………………………………... 6 2.2 Multidimensional and Multivariate………………………………………… 8 3 Visualization Techniques 8 3.1 Classifications……………………………………………………………… 8 3.2 Geometric Projection………………………………………………………. 8 3.2.1 Scatterplot Matrix………………………………………………… 9 3.2.2 Prosection Matrix………………………………………………… 10 3.2.3 HyperSlice………………………………………………………… 10 3.2.4 Hyperbox………………………………………………………… 11 3.2.5 Parallel Coordinates……………………………………………… 11 3.2.6 Radial Coordinate Visualization………………………………….. 12 3.2.7 Andrews Curve…………………………………………………… 12 3.2.8 Star Coordinates……………………………………………………12 3.2.9 Table lens…………………………………………………………. 13 3.3 Pixel-Oriented Techniques…………………………………………………. 13 3.3.1 Space Filling Curve……………………………………………... 14 3.3.2 Recursive Pattern………………………………………………… 15 3.3.3 Spiral and Axes Techniques……………………………………… 15 3.3.4 Circle Segment…………………………………………………… 16 3.3.5 Pixel Bar Chart…………………………………………………… 16 3.4 Hierarchical Display……………………………………………………….. 17 3.4.1 Hierarchical Axis………………………………………………… 17 3.4.2 Dimensional Stacking……………………………………………. 18 3.4.3 Worlds Within Worlds……………………………………………. 18 3.4.4 Treemap…………………………………………………………… 19
3.5 Iconography 19 3.5.1 Chernoff Faces 19 3.5.2 Star Glyph......... 20 3.5.3 Stick Figure….… 20 3.5.4 Shape Coding. 21 3.5.5 Color Icon..… …………………………… 21 3.5.6 Texture.… 22 4 Discussion and Conclusion 25 Bibliography 26
3 3.5 Iconography………………………………………………………………… 19 3.5.1 Chernoff Faces……………………………………………………..19 3.5.2 Star Glyph………………………………………………………… 20 3.5.3 Stick Figure……………………………………………………….. 20 3.5.4 Shape Coding…………………………………………………….. 21 3.5.5 Color Icon………………………………………………………… 21 3.5.6 Texture……………………………………………………………. 22 4 Discussion and Conclusion 25 Bibliography 26
Abstract Multivariate data visualization,as a specific type of information visualization,is an active research field with numerous applications in diverse areas ranging from science communities and engineering design to industry and financial markets,in which the correlations between many attributes are of vital interest. In this survey,we will first review the motivations and challenges of multivariate data visualization.In section 2,a brief terminology is introduced.Some established techniques for multivariate data visualization are described in section 3.These techniques are classified into several categories to provide a basic taxonomy of the field.At the end of this survey,we will discuss some future research directions
4 Abstract Multivariate data visualization, as a specific type of information visualization, is an active research field with numerous applications in diverse areas ranging from science communities and engineering design to industry and financial markets, in which the correlations between many attributes are of vital interest. In this survey, we will first review the motivations and challenges of multivariate data visualization. In section 2, a brief terminology is introduced. Some established techniques for multivariate data visualization are described in section 3. These techniques are classified into several categories to provide a basic taxonomy of the field. At the end of this survey, we will discuss some future research directions
1.Introduction 1.1 Motivations While information is growing in an exponential way,our world is flooded with data which, we believe,should contain some kind of valuable information that can possibly expand the human knowledge.However,extracting the meaningful information is a difficult task when large quantities of data are presented in plain text or traditional tabular form.Effective graphical representations of the data thus enjoy popularity by harnessing the human's visual perception capabilities. Information visualization is the use of computer-based interactive visual representations of abstract and non-physically based data to amplify human cognition.It aims at helping users to effectively detect and explore the expected,as well as discovering the unexpected to gain insight into the data.For multivariate data visualization,the dataset to be visually analyzed is of high dimensionality and these attributes are correlated in some way. Multivariate data are encountered in all aspects by researchers,scientists,engineers, manufacturers,financial managers and various kinds of analysts.Multivariate data visualization is hence strongly motivated by the many situations when they are trying to obtain an integrated understanding of the data distributions and investigate the inter-relationships between different data attributes.Such an effective visual display tool is demanded to facilitate users to identify,locate,distinguish,categorize,cluster,rank,compare, associate or correlate the underlying data [3]. 1.2 Challenges Multivariate data visualization faces the same challenges as information visualization does: Finding good visual representations of a problem can be hard and undeterministic.In addition, multivariate data poses problems in encoding its attributes in a single visual display. Mapping.Finding a suitable mapping of high-dimensional multivariate data into a 2D visual form is never a simple task.It usually depends on the nature of datasets to be visualized and is more related to human perception.Also,association of data attributes to graphical entities requires extreme caution to avoid overwhelming the observer's viewing ability.Conjunction of several elements in the representations may induce cognition overload to the users [6]and graphical attributes should therefore be carefully selected such that they are easy to untangle.It is important that different attributes can be viewed holistically for integrated analysis and,at the same time,each dimension can be judged by users separately and independently
5 1. Introduction 1.1 Motivations While information is growing in an exponential way, our world is flooded with data which, we believe, should contain some kind of valuable information that can possibly expand the human knowledge. However, extracting the meaningful information is a difficult task when large quantities of data are presented in plain text or traditional tabular form. Effective graphical representations of the data thus enjoy popularity by harnessing the human’s visual perception capabilities. Information visualization is the use of computer-based interactive visual representations of abstract and non-physically based data to amplify human cognition. It aims at helping users to effectively detect and explore the expected, as well as discovering the unexpected to gain insight into the data. For multivariate data visualization, the dataset to be visually analyzed is of high dimensionality and these attributes are correlated in some way. Multivariate data are encountered in all aspects by researchers, scientists, engineers, manufacturers, financial managers and various kinds of analysts. Multivariate data visualization is hence strongly motivated by the many situations when they are trying to obtain an integrated understanding of the data distributions and investigate the inter-relationships between different data attributes. Such an effective visual display tool is demanded to facilitate users to identify, locate, distinguish, categorize, cluster, rank, compare, associate or correlate the underlying data [3]. 1.2 Challenges Multivariate data visualization faces the same challenges as information visualization does: Finding good visual representations of a problem can be hard and undeterministic. In addition, multivariate data poses problems in encoding its attributes in a single visual display. Mapping. Finding a suitable mapping of high-dimensional multivariate data into a 2D visual form is never a simple task. It usually depends on the nature of datasets to be visualized and is more related to human perception. Also, association of data attributes to graphical entities requires extreme caution to avoid overwhelming the observer’s viewing ability. Conjunction of several elements in the representations may induce cognition overload to the users [6] and graphical attributes should therefore be carefully selected such that they are easy to untangle. It is important that different attributes can be viewed holistically for integrated analysis and, at the same time, each dimension can be judged by users separately and independently
Dimensionality.Multivariate data is often of huge size and high dimensionality that will most likely result a dense structure.It is hence difficult to present such data in a single visual display,making it challenging to enable users to explore the data space intuitively and interactively,as well as discriminating individual dimensions.Dual view and distortion skills like fisheyes may be helpful to solve this problem. Furthermore,the ordering of dimensions has a major impact on the expressiveness of visualization [7].Different arrangement allows different conclusions to be drawn, but no ordering principle is established so far. Design Tradeoffs.Visualization can provide a qualitative overview of large and complex datasets so that users can look for structure,features,patterns,trends and relationships more effectively [4].Due to the high dimensionality of multivariate data,we inevitably sacrifice the ability to show the details of each attributes [1]as we have fewer graphic attributes for encoding.This situation may not be flavored when quantitative analysis is required.For multivariate data visualization,there is always a tradeoff between amount of information,simplicity and accuracy. Assessment of Effectiveness.The ultimate goal of multivariate data visualization is to gain insight into the data and show the possible correlation between different attributes.In most cases certain correlations are not yet discovered prior to looking at the visual display,and they are exactly what we want to acquire after visualization. It is a paradox [5]that prohibits the assessment of effectiveness of an information visualization technique:We do not know what valuable knowledge is present in the data,so we hope to gain insight by visualizing it.Nevertheless,if we known nothing about the pattern or relationship to be shown in the data representation,we can never assess the effectiveness of a particular visualization technique. 2.Concepts and Terminology 2.1 Dimensionality Dimensionality of a problem in information visualization refers to the number of attributes,or more generally as variables,that presents in the data to be visualized [2].For one-dimensional data,which is also known as univariate data,consists of only one attributes,such as a collection of houses characterized by the cost.They can be visualized effectively by traditional tools like table and histogram.Interpretation of two-dimensional or bivariate data usually utilizes the x-y coordinates of a 2D space.A conventional approach is to plot one variable against the other called scatterplot,see Figure 2.1. 6
6 Dimensionality. Multivariate data is often of huge size and high dimensionality that will most likely result a dense structure. It is hence difficult to present such data in a single visual display, making it challenging to enable users to explore the data space intuitively and interactively, as well as discriminating individual dimensions. Dual view and distortion skills like fisheyes may be helpful to solve this problem. Furthermore, the ordering of dimensions has a major impact on the expressiveness of visualization [7]. Different arrangement allows different conclusions to be drawn, but no ordering principle is established so far. Design Tradeoffs. Visualization can provide a qualitative overview of large and complex datasets so that users can look for structure, features, patterns, trends and relationships more effectively [4]. Due to the high dimensionality of multivariate data, we inevitably sacrifice the ability to show the details of each attributes [1] as we have fewer graphic attributes for encoding. This situation may not be flavored when quantitative analysis is required. For multivariate data visualization, there is always a tradeoff between amount of information, simplicity and accuracy. Assessment of Effectiveness. The ultimate goal of multivariate data visualization is to gain insight into the data and show the possible correlation between different attributes. In most cases certain correlations are not yet discovered prior to looking at the visual display, and they are exactly what we want to acquire after visualization. It is a paradox [5] that prohibits the assessment of effectiveness of an information visualization technique: We do not know what valuable knowledge is present in the data, so we hope to gain insight by visualizing it. Nevertheless, if we known nothing about the pattern or relationship to be shown in the data representation, we can never assess the effectiveness of a particular visualization technique. 2. Concepts and Terminology 2.1 Dimensionality Dimensionality of a problem in information visualization refers to the number of attributes, or more generally as variables, that presents in the data to be visualized [2]. For one-dimensional data, which is also known as univariate data, consists of only one attributes, such as a collection of houses characterized by the cost. They can be visualized effectively by traditional tools like table and histogram. Interpretation of two-dimensional or bivariate data usually utilizes the x-y coordinates of a 2D space. A conventional approach is to plot one variable against the other called scatterplot, see Figure 2.1
300 250 200 150 100 1 5 wine consumed Figure 2.1:A scatterplot illustrating wine consumption against deaths from heart disease.[8] Technically,multivariate data,also termed hypervariate data,is defined for a high dimensionality of three or above.However,as three-dimensional space are what we are living in,three-dimensional or trivariate data is often entertained separately.Modeling the data in a 3D space is the most straightforward way,but problems arise with displaying it in a two-dimensional representation [2].It is hard to compare two points along the same axis,see Figure 2.2(a).A feasible solution,as shown in Figure 2.2(b),is to project the points onto pairs of axes in a two-dimensional scatterplot.3D surfaces such as Figure 2.3(a)also encountered the same difficulty [2],where the minimum value can only be obtained after altering the view as in Figure 2.3(b).Obviously,orientation becomes crucial when dimensionality increases and proper interaction should be able to tackle this problem. (a) (b) 4 00 40 Figure 2.2:(a)A 3D scatterplot,(b)Projection of the points in (a)onto two of the axes [9]. (a) (b) Figure 2.3:(a)A3D surface,(b)A view of(a)by changing the orientation [10]
7 Figure 2.1: A scatterplot illustrating wine consumption against deaths from heart disease. [8] Technically, multivariate data, also termed hypervariate data, is defined for a high dimensionality of three or above. However, as three-dimensional space are what we are living in, three-dimensional or trivariate data is often entertained separately. Modeling the data in a 3D space is the most straightforward way, but problems arise with displaying it in a two-dimensional representation [2]. It is hard to compare two points along the same axis, see Figure 2.2(a). A feasible solution, as shown in Figure 2.2(b), is to project the points onto pairs of axes in a two-dimensional scatterplot. 3D surfaces such as Figure 2.3(a) also encountered the same difficulty [2], where the minimum value can only be obtained after altering the view as in Figure 2.3(b). Obviously, orientation becomes crucial when dimensionality increases and proper interaction should be able to tackle this problem. (a) (b) Figure 2.2: (a) A 3D scatterplot, (b) Projection of the points in (a) onto two of the axes [9]. (a) (b) Figure 2.3: (a) A 3D surface, (b) A view of (a) by changing the orientation [10]
The conceptual boundary between low and high dimensionality is not always precisely stated [11].High-dimensional data is used in a loose manner;it can be arbitrarily defined,but it usually depicts a dimensionality of more than four.It is important to observe that geometric projections in more than four-dimensional are ineffective to convey information to human, which is due to the significant differences to perceive between low and high dimensionality. 2.2 Multidimensional and Multivariate The terms multidimensional and multivariate are often used vaguely.Strictly speaking, multidimensional refers to the dimensionality of the independent dimensions while multivariate refers to that of the dependent variables [12].The more appropriate term for multivariate data visualization should be multidimensional multivariate data visualization [13].Nevertheless,a set of multivariate data is in high dimensionality and can possibly be regarded as multidimensional because the key relationships between the attributes are generally unknown in advance.The multidimensional property is therefore implied in common usage For convenience,the term attributes denote both independent dimensions and dependent variables.It also worth noting that multivariate data visualization is rather generic and does not categorize itself clearly between information visualization and scientific visualization. 3.Visualization Techniques 3.1 Classifications Keim and Kriegel [14][15]divided visual data exploration techniques for multidimensional multivariate data into six classes,namely geometric,icon-based,pixel-oriented,hierarchical, graph-based and hybrid techniques.We will adopt this taxonomy and tailor it to multivariate data visualization techniques,which are classified into four broad categories according to the overall approaches taken to generate resulting visualizations [11]:Geometric projection, pixel-oriented techniques,hierarchical display and iconography.They are elaborated in the following sections.Some representative techniques in each group are described in detail. 3.2 Geometric Projection Geometric projection techniques aim at finding informative projections and transformations of multidimensional datasets [14].It may map the attributes to a typical Cartesian plane like scatterplot,or more innovatively to an arbitrary space such as parallel coordinates
8 The conceptual boundary between low and high dimensionality is not always precisely stated [11]. High-dimensional data is used in a loose manner; it can be arbitrarily defined, but it usually depicts a dimensionality of more than four. It is important to observe that geometric projections in more than four-dimensional are ineffective to convey information to human, which is due to the significant differences to perceive between low and high dimensionality. 2.2 Multidimensional and Multivariate The terms multidimensional and multivariate are often used vaguely. Strictly speaking, multidimensional refers to the dimensionality of the independent dimensions while multivariate refers to that of the dependent variables [12]. The more appropriate term for multivariate data visualization should be multidimensional multivariate data visualization [13]. Nevertheless, a set of multivariate data is in high dimensionality and can possibly be regarded as multidimensional because the key relationships between the attributes are generally unknown in advance. The multidimensional property is therefore implied in common usage. For convenience, the term attributes denote both independent dimensions and dependent variables. It also worth noting that multivariate data visualization is rather generic and does not categorize itself clearly between information visualization and scientific visualization. 3. Visualization Techniques 3.1 Classifications Keim and Kriegel [14] [15] divided visual data exploration techniques for multidimensional multivariate data into six classes, namely geometric, icon-based, pixel-oriented, hierarchical, graph-based and hybrid techniques. We will adopt this taxonomy and tailor it to multivariate data visualization techniques, which are classified into four broad categories according to the overall approaches taken to generate resulting visualizations [11]: Geometric projection, pixel-oriented techniques, hierarchical display and iconography. They are elaborated in the following sections. Some representative techniques in each group are described in detail. 3.2 Geometric Projection Geometric projection techniques aim at finding informative projections and transformations of multidimensional datasets [14]. It may map the attributes to a typical Cartesian plane like scatterplot, or more innovatively to an arbitrary space such as parallel coordinates
Methods fall in this category are good for detecting outliers and correlation amongst different dimensions,and handling huge datasets when appropriate interaction techniques are introduced [15].Intrinsically all data attributes are treated equally,but we must be aware that all dimensions may not be perceived equally [2].As the order in which axes are displayed affects our perception [14],rearrangement is important if the display should not be biased. Another potential problem is visual cluttering and record overlapping [14]which overwhelms the user's perception capabilities due to the high dimensionality or the large size of the data. Some typical techniques using geometric projection are discussed next. 3.2.1 Scatterplot Matrix Scatterplot is used for bivariate discrete data in which two attributes are projected along the x-y axes of the Cartesian coordinates.Scatterplot matrix is an extension for multidimensional data where a collection of scatterplots is organized in a matrix simultaneously to provide correlation information among the attributes,see Figure 3.1.We can easily observe patterns in the relationships between pairs of attributes from the matrix,but there may be important patterns in higher dimensions which are barely recognized in it [17].Another limitation is that it becomes chaotic when the number of points,that is the number of data items,is too large 20 10 400 200 4000 2000 1 50 5 20 40 10 20 20040020004000 50100150200 NPG Acceleration Displacement Weight Horsepower Figure 3.1:A scatterplot matrix for 5-dimensional data of 400 automobiles [17]. Fortunately the technique of brushing [18]can be applied to address the above problem. Brushing aims interpretation by highlighting a particular n-dimensional subspace in the visualization [13],that is,the respective points of interested are colored or highlighted in each scatterplot in the matrix.In Figure 3.1,automobiles are color-coded by the number of cylinders.Manufacturers can analyze the performance of the cars based on the number of cylinders for improvements,while customers can decide how many cylinders they need in order to suit their needs. 9
9 Methods fall in this category are good for detecting outliers and correlation amongst different dimensions, and handling huge datasets when appropriate interaction techniques are introduced [15]. Intrinsically all data attributes are treated equally, but we must be aware that all dimensions may not be perceived equally [2]. As the order in which axes are displayed affects our perception [14], rearrangement is important if the display should not be biased. Another potential problem is visual cluttering and record overlapping [14] which overwhelms the user’s perception capabilities due to the high dimensionality or the large size of the data. Some typical techniques using geometric projection are discussed next. 3.2.1 Scatterplot Matrix Scatterplot is used for bivariate discrete data in which two attributes are projected along the x-y axes of the Cartesian coordinates. Scatterplot matrix is an extension for multidimensional data where a collection of scatterplots is organized in a matrix simultaneously to provide correlation information among the attributes, see Figure 3.1. We can easily observe patterns in the relationships between pairs of attributes from the matrix, but there may be important patterns in higher dimensions which are barely recognized in it [17]. Another limitation is that it becomes chaotic when the number of points, that is the number of data items, is too large. Figure 3.1: A scatterplot matrix for 5-dimensional data of 400 automobiles [17]. Fortunately the technique of brushing [18] can be applied to address the above problem. Brushing aims interpretation by highlighting a particular n-dimensional subspace in the visualization [13], that is, the respective points of interested are colored or highlighted in each scatterplot in the matrix. In Figure 3.1, automobiles are color-coded by the number of cylinders. Manufacturers can analyze the performance of the cars based on the number of cylinders for improvements, while customers can decide how many cylinders they need in order to suit their needs
3.2.2 Prosection Matrix Prosection was first introduced by Furnas and Buja [19];Tweedie and Spence [20]later extended it to prosection matrix which supports a higher dimensionality.A typical prosection is shown in Figure 3.2(a).In the simplest sense,prosection is the orthogonal projections where the data items lie in the selected multidimensional range are colored differently [15]. The yellow rectangles in Figure 3.2(b)indicate the tolerances on parameter values,which is particularly useful for manufacturers to select appropriate parameter ranges.Yet it gives less information about the correlations between more than two attributes. Figure 3.2:(a)A prosection,(b)A prosection matrix [21]. 3.2.3 HyberSlice Like the scatterplot and prosection matrix,HyperSlice [22]has a matrix graphics representing a scalar function of the variables [23],see Figure 3.3.This method targets at continuous scalar functions rather than discrete data.The most significant improvement over scatterplot is the interactive data navigation around a user defined focal point [23].An enhanced HyperSlice was also proposed [24]which incorporate the concept of display resolution supported by space projection,together with the concept of data resolution provided by wavelets to form a powerful multiresolution visualization system. (a) b a 3 1 Figure 3.3:(a)Effect of dragging a slice [22].(b)HyperSlice for 4D function [23]. 0
10 3.2.2 Prosection Matrix Prosection was first introduced by Furnas and Buja [19]; Tweedie and Spence [20] later extended it to prosection matrix which supports a higher dimensionality. A typical prosection is shown in Figure 3.2(a). In the simplest sense, prosection is the orthogonal projections where the data items lie in the selected multidimensional range are colored differently [15]. The yellow rectangles in Figure 3.2(b) indicate the tolerances on parameter values, which is particularly useful for manufacturers to select appropriate parameter ranges. Yet it gives less information about the correlations between more than two attributes. Figure 3.2: (a) A prosection, (b) A prosection matrix [21]. 3.2.3 HyberSlice Like the scatterplot and prosection matrix, HyperSlice [22] has a matrix graphics representing a scalar function of the variables [23], see Figure 3.3. This method targets at continuous scalar functions rather than discrete data. The most significant improvement over scatterplot is the interactive data navigation around a user defined focal point [23]. An enhanced HyperSlice was also proposed [24] which incorporate the concept of display resolution supported by space projection, together with the concept of data resolution provided by wavelets to form a powerful multiresolution visualization system. (a) (b) Figure 3.3: (a) Effect of dragging a slice [22], (b) HyperSlice for 4D function [23]