复旦大学：《基因组学》课程教学资源（学习资料）美国提出基因测序数据分类新标准.doc_大学文库

Science 9 October 2009: vol.326no.5950pp.236-237 DO:10.1126/ science.1180614 GENOMICS Genomics Genome Project Standards in a New Era of Sequencing P. S. G. Chain Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium, Science 9 october 2009: 236-237 For over a decade, genome sequences have adhered to only two standards that are relied on for purposes of sequence analysis by interested third parties (1, 2). However, ongoing developments in revolutionary sequencing technologies have resulted in a redefinition of traditional whole-genome sequencing that requires reevaluation of such standards. With commercially available 454 pyrosequencing(followed by Illumina, SOLiD, and now Helicos), there has been an explosion of genomes sequenced under the moniker"draft however, these can be very poor quality genomes(due to inherent errors in the sequencing technologies, and the inability of assembly programs to fully address these errors). Further, one can only infer that such draft genomes may be of poor quality by navigating through the databases to find the number and type of reads deposited in sequence trace repositories (and not all genomes have this available), or to identify the number of contigs or genome fragments deposited to the database. The difficulty in assessing the quality of such deposited genomes has created some havoc for genome analysis pipelines and has contributed to many wasted hours. Exponential leaps in raw sequencing capability and greatly reduced prices have further skewed the time-and cost-ratios of draft data generation versus the painstaking process of improving and finishing a genome. The result is an ever-widening gap between drafted and finished genomes that only promises to continue(see the figure page 236); hence, there is an urgent need to distinguish good from poor data 美国提出基因测序数据分类新标准时间:2009-10-2708:55来源科技日报美研究人员提出了基因测序数据信息的质量标准,这有利于研究人员开发出更有效的疫苗,或有助于公共健康部门或安全人员更迅速地应对潜在的公共卫生突发事件

Science 9 October 2009: Vol. 326 no. 5950 pp. 236-237 DOI: 10.1126/science.1180614 GENOMICS Genomics Genome Project Standards in a New Era of Sequencing • P. S. G. Chain, • Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium, Science 9 October 2009: 236-237. For over a decade, genome sequences have adhered to only two standards that are relied on for purposes of sequence analysis by interested third parties (1, 2). However, ongoing developments in revolutionary sequencing technologies have resulted in a redefinition of traditional whole-genome sequencing that requires reevaluation of such standards. With commercially available 454 pyrosequencing (followed by Illumina, SOLiD, and now Helicos), there has been an explosion of genomes sequenced under the moniker “draft”; however, these can be very poor quality genomes (due to inherent errors in the sequencing technologies, and the inability of assembly programs to fully address these errors). Further, one can only infer that such draft genomes may be of poor quality by navigating through the databases to find the number and type of reads deposited in sequence trace repositories (and not all genomes have this available), or to identify the number of contigs or genome fragments deposited to the database. The difficulty in assessing the quality of such deposited genomes has created some havoc for genome analysis pipelines and has contributed to many wasted hours. Exponential leaps in raw sequencing capability and greatly reduced prices have further skewed the time- and cost-ratios of draft data generation versus the painstaking process of improving and finishing a genome. The result is an ever-widening gap between drafted and finished genomes that only promises to continue (see the figure, page 236); hence, there is an urgent need to distinguish good from poor data sets. 美国提出基因测序数据分类新标准时间:2009-10-27 08:55 来源:科技日报美研究人员提出了基因测序数据信息的质量标准，这有利于研究人员开发出更有效的疫苗，或有助于公共健康部门或安全人员更迅速地应对潜在的公共卫生突发事件

最近,美国洛斯阿拉莫斯国家实验室(LANL)的一个遗传学小组和一国际财团联合提出了一套旨在阐明可公开获取的基因测序数据信息的质量标准。新标准最终可使遗传研究人员开发出更有效的疫苗,或有助于公共健康部门或安全人员更迅速地应对潜在的公共卫生突发事件。在最新一期的《科学》杂志上,LANL遗传学家帕特里克·钱恩和他的同事提出了6个基因组测序数据标签,可将基因测序数据按其完整性、准确性以及由此带来的可靠性进行归类。这些标签可在公共数据库中获取,而目前使用的标签仅为两个。此项成果的重要性在于,研究人员必须每天使用这样的数据,以对未知遗传数据和已知生物体的遗传数据进行相互参照,而有了这样的新的分类标准,数据的获取与对比工作的效率将大大提高。每个生物体的细胞内都有DNA,由4个分子构建模块(或称碱基对)组成碱基对排成特定序列时就可构成基因。这些基因序列可包含对生物体有益或有害的遗传指令。基因组研究人员编目了数以千计的基因数据,并将其放在公众数据库中以供其他研究者使用。然而,由于基因数据的复杂性,公共数据库中的遗传信息范围从粗略到精致一概都有。过去,这些基因数据常被归类为“草图”和 “成品”两大类,给基因数据的准确性留下了太多的不确定性。钱恩表示,在过去几年里,基因测序技术已取得重大进步,公众可获得的基因数据已呈爆炸性增长,每天产生的碱基对序列数据量要比过去几年产生的数据量还要多几十亿次。不同的测序技术具有不同的精确度。一个序列中的高度不确定性可能会引导研究人员走向一条耗时长达一年甚至数年的错误道路。因此,有必要建立一个标准,为研究人员提供对遗传测序数据质量的明确评估钱恩联合了大大小小的数个基因组测序中心,如美国能源部联合基因组研究所、桑格研究所、人类微生物群系项目 Jumpstart联盟测序中心、密歇根州立大学以及安大略省癌症研究所等,共同提议将现有的测序数据分类从两大类充实为 6大类。这6个标准涵盖了从代表公众提交最低要求的“标准草图序列”到代表最高标准的“完成序列”,而“完成序列”的验收标准是每10万个碱基对中最多只能包含一个错误。 LANL基因科学小组负责人、联合基因组研究所LANL研究中心主任克里斯·戴特表示,该项研究的目的是为了让所有主要的基因组中心和基因组研究小组都能用上符合其需要的分类基因组测序数据。而为了尽可能保证基因组序列的完整性,一些较小的研究中心也可采用这个分类等级来建立和提交其研究成果, 以帮助其他科学家了解既已完成的工作。(冯卫东) Standards for a new genomic Era LANL among organizations proposing new genome sequence strategies Los Alamos, New Mexico, OC TOBER 21, 2009-A team of geneticists at Los

最近，美国洛斯阿拉莫斯国家实验室（LANL）的一个遗传学小组和一国际财团联合提出了一套旨在阐明可公开获取的基因测序数据信息的质量标准。新标准最终可使遗传研究人员开发出更有效的疫苗，或有助于公共健康部门或安全人员更迅速地应对潜在的公共卫生突发事件。在最新一期的《科学》杂志上，LANL 遗传学家帕特里克·钱恩和他的同事提出了 6 个基因组测序数据标签，可将基因测序数据按其完整性、准确性以及由此带来的可靠性进行归类。这些标签可在公共数据库中获取，而目前使用的标签仅为两个。此项成果的重要性在于，研究人员必须每天使用这样的数据，以对未知遗传数据和已知生物体的遗传数据进行相互参照，而有了这样的新的分类标准，数据的获取与对比工作的效率将大大提高。每个生物体的细胞内都有 DNA，由 4 个分子构建模块（或称碱基对）组成，碱基对排成特定序列时就可构成基因。这些基因序列可包含对生物体有益或有害的遗传指令。基因组研究人员编目了数以千计的基因数据，并将其放在公众数据库中以供其他研究者使用。然而，由于基因数据的复杂性，公共数据库中的遗传信息范围从粗略到精致一概都有。过去，这些基因数据常被归类为“草图”和 “成品”两大类，给基因数据的准确性留下了太多的不确定性。钱恩表示，在过去几年里，基因测序技术已取得重大进步，公众可获得的基因数据已呈爆炸性增长，每天产生的碱基对序列数据量要比过去几年产生的数据量还要多几十亿次。不同的测序技术具有不同的精确度。一个序列中的高度不确定性可能会引导研究人员走向一条耗时长达一年甚至数年的错误道路。因此，有必要建立一个标准，为研究人员提供对遗传测序数据质量的明确评估。钱恩联合了大大小小的数个基因组测序中心，如美国能源部联合基因组研究所、桑格研究所、人类微生物群系项目 Jumpstart 联盟测序中心、密歇根州立大学以及安大略省癌症研究所等，共同提议将现有的测序数据分类从两大类充实为 6 大类。这 6 个标准涵盖了从代表公众提交最低要求的“标准草图序列”到代表最高标准的“完成序列”，而“完成序列”的验收标准是每 10 万个碱基对中最多只能包含一个错误。 LANL 基因科学小组负责人、联合基因组研究所 LANL 研究中心主任克里斯·戴特表示，该项研究的目的是为了让所有主要的基因组中心和基因组研究小组都能用上符合其需要的分类基因组测序数据。而为了尽可能保证基因组序列的完整性，一些较小的研究中心也可采用这个分类等级来建立和提交其研究成果，以帮助其他科学家了解既已完成的工作。（冯卫东） Standards for a New Genomic Era LANL among organizations proposing new genome sequence strategies Los Alamos, New Mexico, OCTOBER 21, 2009—A team of geneticists at Los

Alamos National Laboratory, together with a consortium of intemational researchers, has recently proposed a set of standards designed to elucidate the quality of publicly available genetic sequencing information. The new standards could eventually allow genetic researchers to develop vaccines more efficiently or help public health or security personnel more quickly respond to potential public-health emergencies In a recent issue of Science, Los Alamos geneticist Patrick Chain and colleagues presented six labels for genome sequence data that are, or will become, available in public databases rather than the two labels used today The six labels would roughly characterize the completeness and accuracy-and consequently, the potential reliability-of genetic sequencing data. This is of great importance since researchers use such data on a daily basis for cross-referencing unknown genetic material with the genetic material of known organisms. Every living organism with DNA has chromosomes containing the four molecular building blocks, or base pairs, represented by letters A, T, G, and C One chromosome can contain millions of base pairs arranged like rungs on a ladder of DNA. The base pairs are arranged in sets of specific sequences that make up genes. These gene sequences can contain genetic instructions that help or harm an organism-for example by encoding enzymes that digest certain foods, or inducing cellular aberrations that give rise to certain diseases Genome researchers have catalogued genetic data from thousands of organisms and placed them in publicly available libraries. Researchers can use these libraries to crosscheck genetic data, for example when attempting to isolate an unknown public health threat, or to determine where a potentially helpful or harmful gene may be located on an organisms chromosome. For scientific fields such as biofuels research or environ mental remediation genetic data could help researchers determine whether microorganisms can efficienty break down plant matter to aid in ethanol production, or digest environmental contaminants like hydrocarbons However, because of the complexity of genetic data, genetic information in

Alamos National Laboratory, together with a consortium of international researchers, has recently proposed a set of standards designed to elucidate the quality of publicly available genetic sequencing information. The new standards could eventually allow genetic researchers to develop vaccines more efficiently or help public health or security personnel more quickly respond to potential public-health emergencies. In a recent issue of Science, Los Alamos geneticist Patrick Chain and colleagues presented six labels for genome sequence data that are, or will become, available in public databases rather than the two labels used today. The six labels would roughly characterize the completeness and accuracy—and consequently, the potential reliability—of genetic sequencing data. This is of great importance since researchers use such data on a daily basis for cross-referencing unknown genetic material with the genetic material of known organisms. Every living organism with DNA has chromosomes containing the four molecular building blocks, or base pairs, represented by letters A, T, G, and C. One chromosome can contain millions of base pairs arranged like rungs on a ladder of DNA. The base pairs are arranged in sets of specific sequences that make up genes. These gene sequences can contain genetic instructions that help or harm an organism—for example by encoding enzymes that digest certain foods, or inducing cellular aberrations that give rise to certain diseases. Genome researchers have catalogued genetic data from thousands of organisms and placed them in publicly available libraries. Researchers can use these libraries to crosscheck genetic data, for example when attempting to isolate an unknown public health threat, or to determine where a potentially helpful or harmful gene may be located on an organism’s chromosome. For scientific fields such as biofuels research or environmental remediation, genetic data could help researchers determine whether microorganisms can efficiently break down plant matter to aid in ethanol production, or digest environmental contaminants like hydrocarbons. However, because of the complexity of genetic data, genetic information in

public libraries can range from very rough to very refined In the past, genetic data has been classified either as "draft"or" finished, "leaving a wide range of uncertainty about the potential accuracy of genetic data In the past few years we've seen major advances in genetic sequencing technology, so we ve seen an explosion in the amount of publicly available data, said Chain, who is lead author of the Science paper. The amount of base-pair sequencing data generated each day is in the billions-orders of magnitude larger than what was generated a few years ago. Different sequencing technologies have different levels of accuracy. High degrees of uncertainty in a sequence can potentially lead a researcher down a wrong path that they could follow for a year or more. We now have a need for standards that will provide researchers with an unambiguous estimation of the quality of genetic sequence data. Working with researchers from genome sequencing centers big and small-induding the U.s. Department of Energy s Joint Genome Institute, the Sanger Institute, the Human Microbiome Project Jumpstart Consortium sequencing centers, Michigan State University, and the Ontario Institute for Cancer Research among others -chain and colleagues have proposed that sequence data be placed into one of six categories that augment the existing two categories. The six standards range from "standard draft sequence, representing minimum requirements for public submission, to a finished sequence, the highest standard, which can be verified to contain only one sequencing error per 100,000 base pairs My hope is all the major genome centers and advanced genomics groups use the gradations that fit their needs, said Chris Detter, LANL Genome Science Group Leader and Joint Genome Institute-LANL Center director. Some centers may want all six, while some may only want three, but as long as they keep them intact, we are in good shape. Then, my hope is that the smaller genomics groups adopt the classes as written to help the rest of the scientific community know what they are generating and submitting Other DOE JGI authors on the Science paper include David Bruce, Phil

public libraries can range from very rough to very refined. In the past, genetic data has been classified either as “draft” or “finished,” leaving a wide range of uncertainty about the potential accuracy of genetic data. “In the past few years we’ve seen major advances in genetic sequencing technology, so we’ve seen an explosion in the amount of publicly available data,” said Chain, who is lead author of the Science paper. “The amount of base-pair sequencing data generated each day is in the billions—orders of magnitude larger than what was generated a few years ago. Different sequencing technologies have different levels of accuracy. High degrees of uncertainty in a sequence can potentially lead a researcher down a wrong path that they could follow for a year or more. We now have a need for standards that will provide researchers with an unambiguous estimation of the quality of genetic sequence data.” Working with researchers from genome sequencing centers big and small—including the U.S. Department of Energy’s Joint Genome Institute, the Sanger Institute, the Human Microbiome Project Jumpstart Consortium sequencing centers, Michigan State University, and the Ontario Institute for Cancer Research, among others—Chain and colleagues have proposed that sequence data be placed into one of six categories that augment the existing two categories. The six standards range from “standard draft sequence,” representing minimum requirements for public submission, to a “finished sequence,” the highest standard, which can be verified to contain only one sequencing error per 100,000 base pairs. “My hope is all the major genome centers and advanced genomics groups use the gradations that fit their needs,” said Chris Detter, LANL Genome Science Group Leader and Joint Genome Institute-LANL Center director. “Some centers may want all six, while some may only want three, but as long as they keep them intact, we are in good shape. Then, my hope is that the smaller genomics groups adopt the classes as written to help the rest of the scientific community know what they are generating and submitting.” Other DOE JGI authors on the Science paper include David Bruce, Phil