农业院校《试验设计与分析》课程参考资料（多变量分析方法）DUDE - A User-Friendly Crop Information System.pdf_大学文库

DUDE: A User-Friendly Crop Information System Weikai Yan* and Nicholas A. Tinker ABSTRACT A crop information system facilitates the storage, retrieval, and utilization of historical crop performance data. Here, we report a userfriendly crop information system entitled Data Unification and Distillation Engine (DUDE), which (i) converts data from spreadsheet formats to a relational database, (ii) allows editing of tables in the database, (iii) simplifies the construction and execution of queries from the database, (iv) formats queried data into a variety of file structures for data analyses, and (v) simplifies maintenance and corrections of data in the database. DUDE runs on any personal computer that operates with Windows 2000 or later versions of Microsoft Windows. DUDE is freely available from the authors. D ATA FROM CROP PERFORMANCE TRIALS are not only essential for achieving short-term goals of selecting and recommending crop cultivars, they are also valuable in addressing long-term questions about the target environments and the crop as an integrated physiological system (Yan and Tinker, 2005; Yan et al., 2007). Two types of informatics tools are essential to achieving such long term goals: tools for assembling historical crop performance trial data and tools for exploring patterns within such data. The biplot analysis system (Yan et al., 2000, 2007; Yan, 2001; Yan and Rajcan, 2002; Yan and Kang, 2003; Yan and Tinker, 2005, 2006) has been described as an effective tool for exploring historical crop performance trial data. This paper describes a user-friendly crop information system (DUDE) that can facilitate the assembly and use of crop performance data. Tinker and Yan (2006) reviewed crop information systems that are currently available and described in detail an example of a relational database on which a crop information system could be based (Fig. 1). This database structure is referred to as a Context Oriented Observation Library (COOL), which is the underlying relational database for DUDE. A COOL database has the following functionalities: (i) it accommodates various types of data (numeric or nonnumeric) for a large1 number of variables (measured traits, genetic markers, treatment factors) for a large number of genotypes, from a large number of studies, years, and locations; (ii) it provides a mechanism for unifying the formats and terminologies of variables, genotypes, and locations, which are typically heterogeneous among different data sources; and (iii) it simplifies the construction and execution of queries for any data subset, thereby facilitating data mining. More detailed description on the COOL database structure can be found in Tinker and Yan (2006). The greatest difficulty in the use of COOL also derives from its greatest advantages, namely, its serialized data structure and its division of different data types into separate, linked tables. First, it is a difficult task to convert conventional spreadsheet data, which usually has identifiers, traits, and other measurements in parallel columns, to a COOL database, while keeping the context information properly preserved. Although this can be done manually, it involves multiple steps and is error-prone even for highly trained workers. Second, although the serialized data structure of COOL facilitates querying subsets of the data, queried data must be reformatted to parallel data before they can be used in some types of data analysis. Again, although this can be done manually, it is tedious and prone to error. The DUDE application was developed to solve both of these issues, thereby providing an integrated crop information system. DUDE simplifies the task of data import through the use of wizards (i) to convert data from spreadsheet format to a COOL database, (ii) to edit various tables of the database, (iii) to unify trait (variable), genotype, and location names within the database, (iv) to query subsets of data from a COOL database, (v) to save queried data into required formats for data analyses, and (vi) to make corrections in the database. The DUDE application was described briefly by Tinker and Yan (2006). This paper will provide a more detailed description on an updated version of DUDE, and will formally introduce its availability to the research community. VIEWING AND EDITING A COOL DATABASE When DUDE is executed, an opening panel offers two options: converting data from spreadsheets to a COOL database (Populate) or extracting data from a COOL database (Query). When the Populate button is clicked, an open file dialog will appear, asking for a Microsoft Access database with a COOL data structure, to which DUDE will be attached. When a valid COOL database is selected, an interface similar to Fig. 2 will be displayed, which provides options for viewing and editing each of the database tables and for populating the database. Each of the eight tables in the COOL database (Fig. 1) can be viewed by clicking the appropriate button on the top-left of the window (Fig. 2). If the Edit button is clicked and a table is selected, that table will be displayed in the upper left area of Fig. 2 and each field of the current record (row) in that table will be displayed in area on the right. This interface allows the table to be modified from either the upper-left area or the area on the right. One important use of the Edit mode is that heterogeneous trait, variety, and location names can be unified to facilitate data query and analysis (detailed in the data unification section below). A COOL database can also be directly edited through Microsoft Access. However, the user should be careful in deleting/adding fields of a table, as this may affect the integrity of the database and its compatibility with DUDE. It is possible to use DUDE to manage a COOL database without opening the database using MS Access. 1 The current implementation of DUDE uses a COOL database in Microsoft Access file format. The theoretical limit on number of records in a MS Access table is over 2 billion. However, the size of a database is limited to 2 GB, so it is not possible to specify the maximum number of records in a given table. Eastern Cereal and Oilseed Research Centre (ECORC), Agriculture and Agri-Food Canada (AAFC), K.W. Neatby Building, 960 Carling Ave., Ottawa, ON, Canada, K1A 0C6. ECORC Contribution No. 06- 735. Received 10 Oct. 2006. *Corresponding author (yanw@agr.gc.ca). Published in Agron. J. 99:1029–1033 (2007). Software doi:10.2134/agronj2006.0280 ª American Society of Agronomy 677 S. Segoe Rd., Madison, WI 53711 USA Abbreviations: COOL, context oriented observation library; DUDE, data unification and distillation engine. Reproduced from Agronomy Journal. Published by American Society of Agronomy. All copyrights reserved. 1029 Published online June 5, 2007

be prompted to select a table from it. The selected table will then be displayed in the lower-left area. The format of the spreadsheet is flexible, except that different traits or identifiers must be in parallel columns. The identifiers and traits can be anything and no identifiers and traits are essential. To avoid treating different traits as the same trait, however, it is highly advisable that a trait name should contain, explicitly or implicitly, three pieces of information: property, scale (unit), and method, as suggested by the authors of the International Crop Information System (www.icis. cgiar.org, verified 8 Mar. 2007), while preferably keeping the names as short as possible. Examples of trait names are Yield_kg/ha (for yield in kilograms per hectare) and Protein%_NIR (protein concentration in percentage measured using a near-infrared analyzer). Such information will be helpful for trait name unification and data querying. Two traits with different names can be merged by giving them the same alias, but two traits with identical names cannot be separated. Data conversion is a process whereby columns of data in the spreadsheet (the lower-left area, Fig. 2) are populated into appropriate tables of the COOL database while preserving the relationships among them. Data conversion using DUDE is initiated using the Start button (Fig. 2). DUDE will first collect all the headers in the spreadsheet and present them in the Variates/Factors side of the “Separate between Context info and Variates” dialog box (Fig. 3). This list is intended to capture all measured items, that is, various traits or genetic markers or experimental factors. All other headers should be moved to the Context Info list if they describe the experimental units, tests, studies, locations, varieties, cooperators, and so forth, or they can be ignored if they are not relevant. When the headers are appropriately separated and the Header Separator closed, items in the variables/factors list box will be populated into the T_Traits table, and items in the context information list box will be listed in combo boxes on the right side of Fig. 2 (Fig. 4) for use when populating the other tables. DUDE will then guide the user through appropriate steps to populate the other tables in the database, namely, T_Studies, T_Varieties, T_Locations, T_Tests, T_ExpUnits, and T_Obs, and in that order. For each table in the database (upper left area) fields are assigned from columns in the spreadsheet (lower left area) that contain appropriate context information. Only one field is essential for each table, and this field is highlighted in yellow (Fig. 4). The essential field is Name for tables T_Studies, T_Varieties, and T_Locations. It is Yr for the table T-Tests and Block for the table T_ExpUnits. Shown in Fig. 4 is when the T_Varieties table is populated, and the field Name is highlighted as essential. If no appropriate header is found in the combo box for a required field, then it will be assumed that all new records share a common value for that field, and the user is allowed to enter a constant value in the corresponding combo box. The purpose of populating all tables before the table T_Obs is to set up the appropriate relationships among tables. When this is done, each of the observed data is then inserted into the T_Obs table, each record being uniquely defined by a record in the experimental units table and by a record in the traits table (Fig. 1). The observations will be populated individually; empty cells or cells indicated as missing values will be ignored. After a COOL database has been populated with data from a spreadsheet, the same database can be populated with additional data from another spreadsheet in the same file or from other files. DUDE checks and prevents the same dataset from being populated again by comparing each data point that is to be populated with the values that are already in the database. UNIFYING TRAIT, GENOTYPE, AND LOCATION NAMES Crop performance trial data can be highly heterogeneous, as different investigators and the same investigators in different times may give the same trait or genotype or location different names or abbreviations. An important function of the COOL database is that it allows easy unification of the trait (variable), genotype, and location names. For example, the table for traits (T_Traits) has two fields by which a record can be identified (Name and QueryName; Fig. 5). Both fields are Fig. 3. Interface of DUDE to separate the spreadsheet headers into variables and context information. Fig. 4. The right portion of Fig. 2 on the Populate mode. Reproduced from Agronomy Journal. Published by American Society of Agronomy. All copyrights reserved. YAN & TINKER: A USER-FRIENDLY CROP INFORMATION SYSTEM 1031

required, but the Name field is used to designate unique values at the time data is imported. The QueryName field is used to hold names for querying, and it allows duplications. While the Name field should never be altered, the QueryNames field can be manually edited to identify two records that should be queried as if they are the same trait. This allows all data about a trait to be extracted at a single query, no matter what it was originally referred to by the investigators. For example, in Fig. 5, grain yield in kg ha21 was originally entered as KG HA, YIELD_KGHA, YIELD, YIELD(KGHA), or YIELDKGHA by different investigators. Once it is confirmed that these different trait names are for the same trait and are in the same units, they can all be given the same value (e.g., FYIELD_) in the QueryName field. A similar name unification mechanism is available for the tables T_Varieties and T_Locations. Data unification can be performed using the DUDE interface (Fig. 2) or directly through Microsoft Access. QUERYING DATA FROM A COOL DATABASE When DUDE is started and the button for Query is selected, an Open File dialog will appear, asking for a COOL database file. When a valid file is opened, all observations in the database will be displayed in a spreadsheet-like window as shown in Fig. 6. The user can then restrict the observations that are shown by selecting (or deselecting) any of the studies, years, locations, blocks (replications), varieties, data type (numerical vs. nonnumerical), variable type (traits, markers, or treatment factors), or variables that are present in the database by clicking the appropriate buttons on the left of Fig. 6. Figure 7 shows how the data can be restricted to certain years. When the Select Years button is clicked, all years that are involved in the data that are currently displayed will be listed and any of the years can be selected or deselected (Fig. 7). When the Year Selection Form is closed, data displayed in the spreadsheet-like window will be limited to the selected years. In a similar manner, data can be limited by studies, genotypes, locations, traits, and so forth. Fig. 5. Mechanism of DUDE for trait name unification in the COOL database. Fig. 6. DUDE interface for querying a COOL database. Fig. 7. DUDE Interface for limiting data to certain years. Reproduced from Agronomy Journal. Published by American Society of Agronomy. All copyrights reserved. 1032 AGRONOMY JOURNAL, VOL. 99, JULY–AUGUST 2007

SAVING QUERIED DATA INTO APPROPRIATE FORMATS FOR ANALYSIS Once the data have been queried, and an appropriate data subset is displayed, the subset must be saved to a file before it can be used in further data analysis. DUDE can save the data in serial format, exactly as it is displayed, with values for all traits in a single column as shown in Fig. 6. This format is, however, not readily accessible for conventional data analyses, although it is readable by the GGEbiplot software (Yan, 2001). Alternatively, DUDE allows queried data to be saved into the following formats (Fig. 8): (i) MQTL-marker format for molecular marker data. This format conforms to requirements for QTL mapping by the MQTL software (Tinker and Mather, 1995) and is useful if DUDE has been used to assemble marker data (each marker is treated as a trait); (ii) MQTLphenotype format for phenotypic traits. This format conforms to requirements for QTL mapping using the MQTL software (Tinker and Mather, 1995); (iii) Traits in parallel columns. This is the format required by many statistical software packages; (iv) Genotype-by-trait table averaged across replications and environments; and (v) Genotype-by-environment table for each trait. When this option is selected, the user has options to save all queried data, balance the subset by deleting environments with missing cells, or balance the subset by deleting genotypes with missing cells. Each data point will be the arithmetic mean across replications. CORRECTING DATA Once a COOL database has been populated, it is difficult to pinpoint and correct an observation. Alternatively, DUDE provides an easy way to correct data in a COOL database. When the Correct Data button (Fig. 6) is clicked, DUDE allows the user to choose a data point specific to a year, location, genotype, replication, and trait (variable) and to give it a correct value. This is an edit-only function and does not involve data integrity. SYSTEM REQUIREMENTS AND AVAILABILITY The DUDE package is about 2.5 MB. It runs on any personal computer that is installed with Windows 2000 or later versions of Microsoft Windows. It should work on all versions of MS Excel and MS Access files but our experience is limited to MS Office 2000 and Office XP. The DUDE package, along with some sample data files, is freely available from the authors. REFERENCES Tinker, N.A., and D.E. Mather. 1995. Methods for QTL analysis with progeny replicated in multiple environments. J. Quant. Trait Loci 1(1). Available at http://wheat.pw.usda.gov/jag/papers95/paper195/ indexp195.html [verified 9 Mar. 2007]. Tinker, N.A., and W. Yan. 2006. Information systems for crop performance data. Can. J. Plant Sci. 86:647–662. Yan, W. 2001. GGEbiplot—A Windows application for graphical analysis of multienvironment trial data and other types of two-way data. Agron. J. 93:1111–1118. Yan, W., L.A. Hunt, Q. Sheng, and Z. Szlavnics. 2000. Cultivar evaluation and mega-environment investigation based on the GGE biplot. Crop Sci. 40:597–605. Yan, W., and M.S. Kang. 2003. GGE biplot analysis: A graphical tool for breeders, geneticists, and agronomists. CRC Press, Boca Raton, FL. Yan, W., S.J. Molnar, J. Fregeau-Reid, A. McElroy, and N.A. Tinker. 2007. Associations among oat traits and their responses to the environment. J. Crop Improve. 20: (in press). Yan, W., and I. Rajcan. 2002. Biplot evaluation of test sites and trait relations of soybean in Ontario. Crop Sci. 42:11–20. Yan, W., and N.A. Tinker. 2005. An integrated system of biplot analysis for displaying, interpreting, and exploring genotype-byenvironment interactions. Crop Sci. 45:1004–1016. Yan, W., and N.A. Tinker. 2006. Biplot analysis of multi-environment trial data: Principles and applications. Can. J. Plant Sci. 86:623–645. Fig. 8. Options for saving queried data into various formats. Reproduced from Agronomy Journal. Published by American Society of Agronomy. All copyrights reserved. YAN & TINKER: A USER-FRIENDLY CROP INFORMATION SYSTEM 1033