正在加载图片...
NATUREIVol 437 27 October 2005 ARTICLES also Supplementary Table 1). All genotyping centres produced high-(Supplementary Fig. 1): only 3.3% of inter-SNP distances are longer ality data(accuracy more than 99% in the blind Q4 upplementary Tables 2 and 3), and missing data were erozygotes. The Supplementary Information m than 10 kb, spanning 11.9% of the genome( Fig. 2; see also Sup- plementary Fig. 2). One exception is the X chromosome( Suppleme tary Fig. 1), where a much higher proportion of attempted SNPs were full details of these efforts rare or monomorphic, and thus the density of common SNPs is lower. Although SNP selection was generally agnostic to functional Two intentional exceptions to the regular spacing of SNPs on the annotation,11,500 non-synonymous CSNPs(SNPs in coding regions physical map were the mitochondrial chromosome(mtDNA), which of genes where the different SNP alleles code for different amino acids does not undergo recombination, and the non-recombining portion in the protein) were successfully typed in Phase I(An effort was of chromosome Y. On the basis of the 168 successful, polymorphic made to prioritize cSNPs in Phase I in choosing SNPs for each 5-kb SNPs, each HapMap sample fell into one of 15(of the 18 known egion; all known non-synonymous CSNPs were attempted as part of mtDNA haplogroups"(Table 4). A total of 84 SNPs that charac Phase IL.) the unique branches of the reference Y genealogical tree-3were Across the ten ENCODE regions(Table 2), the density of SNPs was genotyped on the Hap Map samples. These SNPs assigned each Y approximately tenfold higher as compared to the genome-wide map: chromosome to 8(of the 18 major)Y haplogroups previous 17,944 SNPs across the 5 megabases(Mb)(one per 279 bp) described(Table 4) More than 1.3 million SNP genotyping assays were attempted Highly accurate phasing of long-range chromosomal haplotypes. (Table 3)to generate the Phase I data on more than 1 million SNPs. Despite having collected data in diploid individuals, the inclusion of The 0.3 million SNPs not part of the Phase I data set include 73, 652 that parent-offspring trios and the use of computational methods made it passed QC filters but were monomorphic in all 269 samples. The possible to determine long-range phased haplotypes of extremely remaining SNPs failed the QC filters in one or more analysis panels high quality for each individual. These computational algorithms mostly because of inadequate completeness, non-mendelian inheri- take advantage of the observation that because of LD, relatively few of tance, deviations from Hardy-Weinberg equilibrium, discrepant the large number of possible haplotypes consistent with the genotype notypes among duplicates, and data transmission discrep y and The project compared a variety of algorithms for phasing haplo- mtDNA. The Phase I data include a successful, common SNP types from unrelated individuals and trios and applied the algo every 5 kb across most of the genome in each analysis panel rithm that proved most accurate(an updated version of PHASE) Table 2 ENCODE project regions Available SNPs Region Chromosome(bas nsity Conservation G+C )t (%)5 score(%)5 (cM Mb -) content# dbSNP* Sequence" Total 二 51633.239- 70 1.762 33322,275 ENr312q371 234,778,639-4.6 0431,736 1,259 2995 444 2,053 3,4972,201 ENm01o7p15226,6 20 0.44 795 3015 ENm0137q211389395,71 0381,394 9,895,717 ENm0147q3133126135,436-29 11.2 0 2,984 0411430 9,269627 ENr2329q34127,061,347-5983 0521,444 296 1,324 12a12 38,626,4 3.1 0.3 0.36 37932561792 Baylor EN2318q12123,717,221 1459 2,7891,640 24,217 4,76516,3193108417944 ed to 500 kb for resequencing. Gene density is defined as the percentage of bases covered either by ensembl genes or human mRNa best BLAT alignments in the uCSC Genome Browser database ons corresponding to the following were discarded: Ensembl genes, all GenBank mRNA Blastz alignments, FGenesh++gene predictions, Twinscan gene predictions, spliced E> ents is eir ent were discarded. of th V ast 80% base identity we ination rate based on estimates from lD SNPs in dbSNP build 121 at the time the ENCODE resequencing began DSNP in builds 122-125 independent of the resequencing in builds 122-125) tt SNPs successfully genotyped in all analysis panels (YRI, CEU, CHB+JP +* Perlegen genotyped a subset of SNPs in the CEU sample 1301 2005 Nature Publishing Group© 2005 Nature Publishing Group also Supplementary Table 1). All genotyping centres produced high￾quality data (accuracy more than 99% in the blind QA exercises, Supplementary Tables 2 and 3), and missing data were not biased against heterozygotes. The Supplementary Information contains the full details of these efforts. Although SNP selection was generally agnostic to functional annotation, 11,500 non-synonymous cSNPs (SNPs in coding regions of genes where the different SNP alleles code for different amino acids in the protein) were successfully typed in Phase I. (An effort was made to prioritize cSNPs in Phase I in choosing SNPs for each 5-kb region; all known non-synonymous cSNPs were attempted as part of Phase II.) Across the ten ENCODE regions (Table 2), the density of SNPs was approximately tenfold higher as compared to the genome-wide map: 17,944 SNPs across the 5 megabases (Mb) (one per 279 bp). More than 1.3 million SNP genotyping assays were attempted (Table 3) to generate the Phase I data on more than 1 million SNPs. The 0.3 million SNPs not part of the Phase I data set include 73,652 that passed QC filters but were monomorphic in all 269 samples. The remaining SNPs failed the QC filters in one or more analysis panels mostly because of inadequate completeness, non-mendelian inheri￾tance, deviations from Hardy–Weinberg equilibrium, discrepant genotypes among duplicates, and data transmission discrepancies. SNPs on the Phase I map are evenly spaced, except on Y and mtDNA. The Phase I data include a successful, common SNP every 5 kb across most of the genome in each analysis panel (Supplementary Fig. 1): only 3.3% of inter-SNP distances are longer than 10 kb, spanning 11.9% of the genome (Fig. 2; see also Sup￾plementary Fig. 2). One exception is the X chromosome (Supplemen￾tary Fig. 1), where a much higher proportion of attempted SNPs were rare or monomorphic, and thus the density of common SNPs is lower. Two intentional exceptions to the regular spacing of SNPs on the physical map were the mitochondrial chromosome (mtDNA), which does not undergo recombination, and the non-recombining portion of chromosome Y. On the basis of the 168 successful, polymorphic SNPs, each HapMap sample fell into one of 15 (of the 18 known) mtDNA haplogroups34 (Table 4). A total of 84 SNPs that characterize the unique branches of the reference Y genealogical tree35–37 were genotyped on the HapMap samples. These SNPs assigned each Y chromosome to 8 (of the 18 major) Y haplogroups previously described (Table 4). Highly accurate phasing of long-range chromosomal haplotypes. Despite having collected data in diploid individuals, the inclusion of parent–offspring trios and the use of computational methods made it possible to determine long-range phased haplotypes of extremely high quality for each individual. These computational algorithms take advantage of the observation that because of LD, relatively few of the large number of possible haplotypes consistent with the genotype data actually occur in population samples. The project compared a variety of algorithms for phasing haplo￾types from unrelated individuals and trios38, and applied the algo￾rithm that proved most accurate (an updated version of PHASE39) Table 2 | ENCODE project regions and genotyping Region name Chromosome band Genomic interval (NCBI) (base numbers)† Gene density (%)‡ Conservation score (%)§ Pedigree-based recombination rate (cM Mb21 )k Population￾based recombination rate (cM Mb21 ){ GþC content# dbSNPq Available SNPs Sequence** Total Successfully genotyped SNPs†† Sequencing centre/ genotyping centre(s)‡‡ ENr112 2p16.3 51,633,239– 52,133,238 0 3.8 0.8 0.9 0.35 1,570 1,762 3,332 2,275 Broad/ McGill￾GQIC ENr131 2q37.1 234,778,639– 235,278,638 4.6 1.3 2.2 2.5 0.43 1,736 1,259 2,995 1,910 Broad/ McGill￾GQIC ENr113 4q26 118,705,475– 119,205,474 0 3.9 0.6 0.9 0.35 1,444 2,053 3,497 2,201 Broad/ Broad ENm010 7p15.2 26,699,793– 27,199,792 5.0 22.0 0.9 0.9 0.44 1,220 1,795 3,015 1,271 Baylor/ UCSF-WU, Broad ENm013* 7q21.13 89,395,718– 89,895,717 5.5 4.4 0.4 0.5 0.38 1,394 1,917 3,311 1,807 Broad/ Broad ENm014* 7q31.33 126,135,436– 126,632,577 2.9 11.2 0.4 0.9 0.39 1,320 1,664 2,984 1,966 Broad/ Broad ENr321 8q24.11 118,769,628– 119,269,627 3.2 11.4 0.6 1.1 0.41 1,430 1,508 2,938 1,758 Baylor/ Illumina ENr232 9q34.11 127,061,347– 127,561,346 5.9 8.3 2.7 2.6 0.52 1,444 1,523 2,967 1,324 Baylor/ Illumina ENr123 12q12 38,626,477– 39,126,476 3.1 1.7 0.3 0.8 0.36 1,877 1,379 3,256 1,792 Baylor / Baylor ENr213 18q12.1 23,717,221– 24,217,220 0.9 7.4 1.2 0.9 0.37 1,330 1,459 2,789 1,640 Baylor/ Illumina Total – – – – – – – 14,765 16,319 31,084 17,944 – McGill-GQIC, McGill University and Ge´nome Que´bec Innovation Centre. *These regions were truncated to 500 kb for resequencing. †Sequence build 34 coordinates. ‡Gene density is defined as the percentage of bases covered either by Ensembl genes or human mRNA best BLAT alignments in the UCSC Genome Browser database. §Non-exonic conservation with mouse sequence was measured by taking 125 base non-overlapping sub-windows inside the 500,000 base windows. Sub-windows with less than 75% of their bases in a mouse alignment were discarded. Of the remaining sub-windows, those with at least 80% base identity were used to calculate the conservation score. The mouse alignments in regions corresponding to the following were discarded: Ensembl genes, all GenBank mRNA Blastz alignments, FGeneshþþ gene predictions, Twinscan gene predictions, spliced EST alignments, and repeats. kThe pedigree-based sex-averaged recombination map is from deCODE Genetics48. {Recombination rate based on estimates from LDhat46. #G þ C content calculated from the sequence of the stated coordinates from sequence build 34. qSNPs in dbSNP build 121 at the time the ENCODE resequencing began and SNPs added to dbSNP in builds 122–125 independent of the resequencing. **New SNPs discovered through the resequencing reported here (not found by other means in builds 122–125). ††SNPs successfully genotyped in all analysis panels (YRI, CEU, CHB þ JPT). ‡‡Perlegen genotyped a subset of SNPs in the CEU samples. NATURE|Vol 437|27 October 2005 ARTICLES 1301
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有