Vol 437 27 October 2005 doi: 10. 1038/nature04226 nature ARTICLES A haplotype map of the human genome The International HapMap Consortium Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs)for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, ncluding ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution Despite the ever-accelerating pace of biomedical research, the root diabetes), PTPN22(rheumatoid arthritis and type 1 diabetes)? causes of common human diseases remain largely unknown, pre- insulin(type 1 diabetes ), CTLA4(autoimmune thyroid disease, type ventative measures are generally inadequate, and available treatments I diabetes), NOD2(inflammatory bowel disease)., complement are seldom curative. Family history is one of the strongest risk factors factor H(age-related macular degeneration)sS and RET(Hirsch- for nearly all diseases-including cardiovascular disease, cancer, sprung disease)b. among many others diabetes, autoimmunity, psychiatric illnesses and many others Systematic studies of common genetic variants are facilitated by providing the tantalizing but elusive clue that inherited genetic the fact that individuals who carry a particular SNP allele at one site variation has an important role in the pathogenesis of disease. often predictably carry specific alleles at other nearby variant sites Identifying the causal genes and variants would represent an impor- This correlation is known as linkage disequilibrium(LD); a particu tant step in the path towards improved prevention, diagnosis and lar combination of alleles along a chromosome is termed a haplotype treatment of disease LD exists because of the shared ancestry of contemporary chromo More than a thousand genes for rare, highly heritable 'mendelian somes. When a new causal variant arises through mutation -whether disorders have been identified, in which variation in a single gene is a single nucleotide change, insertion/deletion, or structural altera both necessary and sufficient to cause disease. Common disorders, in tion-it is initially tethered to a unique chromosome on which it contrast, have proven much more challenging to study, as they occurred, marked by a distinct combination of genetic variants are thought to be due to the combined effect of many different Recombination and mutation subsequently act to erode this associ- susceptibility DNA variants interacting with environmental factors. ation, but do so slowly (each occurring at an average rate of about Studies of common diseases have fallen into two broad categories: 10 per base pair(bp) per generation)as compared to the number family-based linkage studies across the entire genome, and popu- of generations(typically 10 to 10) since the mutational event lation-based association studies of individual candidate genes The correlations between causal mutations and the haplotypes on Although there have been notable successes, progress has been slow which they arose have long served as a tool for human genetic due to the inherent limitations of the methods; linkage analysis has research: first finding association to a haplotype, and then sub low power except when a single locus explains a substantial fraction sequently identifying the causal mutation(s) that it carries. This was of disease, and association studies of one or a few candidate genes pioneered in studies of the HLA region, extended to identify causal examine only a small fraction of the universe of sequence variation genes for mendelian diseases(for example, cystic fibrosis s and in each patient. diastrophic dysplasia), and most recently for complex disorders A comprehensive search for genetic influences on disease would such as age-related macular degeneration involve examining all genetic differences in a large number of affected Early information documented the existence of LD in the human individuals and controls. It may eventually become possible to genome20.; however, these studies were limited(for technical accomplish this by complete genome resequencing. In the meantime, reasons)to a small number of regions with incomplete data, and it is increasingly practical to systematically test common general patterns were challenging to discern. With the sequencing of ariants for their role in disease; such variants explain much the human genome and development of high-throughput genomic genetic diversity in our species, a consequence of the hist methods, it became clear that the human genome generally small size and shared ancestry of the human population. displays more LDthan under simple population genetic models Recent experience bears out the hypothesis that common variants and that LD is more varied across regions, and more segmentally have an important role in disease, with a partial list of validated structured2-3o, than had previously been supposed. These obser examples including HLA(autoimmunity and infection), APOE4 vations indicated that LD-based methods would generally have ( Alzheimer's disease, lipids)?, Factor VLeiden deep vein thrombosis), great value(because nearby SNPs were typically correlated with PPARG (encoding PPARY; type 2 diabetes), KCNJ1l(type 2 many of their neighbours), and also that LD relationships would ists of participants and affiliations appear at the end of the paper 2005 Nature Publishing Group
© 2005 Nature Publishing Group A haplotype map of the human genome The International HapMap Consortium* Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution. Despite the ever-accelerating pace of biomedical research, the root causes of common human diseases remain largely unknown, preventative measures are generally inadequate, and available treatments are seldom curative. Family history is one of the strongest risk factors for nearly all diseases—including cardiovascular disease, cancer, diabetes, autoimmunity, psychiatric illnesses and many others— providing the tantalizing but elusive clue that inherited genetic variation has an important role in the pathogenesis of disease. Identifying the causal genes and variants would represent an important step in the path towards improved prevention, diagnosis and treatment of disease. More than a thousand genes for rare, highly heritable ‘mendelian’ disorders have been identified, in which variation in a single gene is both necessary and sufficient to cause disease. Common disorders, in contrast, have proven much more challenging to study, as they are thought to be due to the combined effect of many different susceptibility DNA variants interacting with environmental factors. Studies of common diseases have fallen into two broad categories: family-based linkage studies across the entire genome, and population-based association studies of individual candidate genes. Although there have been notable successes, progress has been slow due to the inherent limitations of the methods; linkage analysis has low power except when a single locus explains a substantial fraction of disease, and association studies of one or a few candidate genes examine only a small fraction of the ‘universe’ of sequence variation in each patient. A comprehensive search for genetic influences on disease would involve examining all genetic differences in a large number of affected individuals and controls. It may eventually become possible to accomplish this by complete genome resequencing. In the meantime, it is increasingly practical to systematically test common genetic variants for their role in disease; such variants explain much of the genetic diversity in our species, a consequence of the historically small size and shared ancestry of the human population. Recent experience bears out the hypothesis that common variants have an important role in disease, with a partial list of validated examples including HLA (autoimmunity and infection)1 , APOE4 (Alzheimer’s disease, lipids)2 , Factor VLeiden (deep vein thrombosis)3 , PPARG (encoding PPARg; type 2 diabetes)4,5, KCNJ11 (type 2 diabetes)6 , PTPN22 (rheumatoid arthritis and type 1 diabetes)7,8, insulin (type 1 diabetes)9 , CTLA4 (autoimmune thyroid disease, type 1 diabetes)10, NOD2 (inflammatory bowel disease)11,12, complement factor H (age-related macular degeneration)13–15 and RET (Hirschsprung disease)16,17, among many others. Systematic studies of common genetic variants are facilitated by the fact that individuals who carry a particular SNP allele at one site often predictably carry specific alleles at other nearby variant sites. This correlation is known as linkage disequilibrium (LD); a particular combination of alleles along a chromosome is termed a haplotype. LD exists because of the shared ancestry of contemporary chromosomes. When a new causal variant arises through mutation—whether a single nucleotide change, insertion/deletion, or structural alteration—it is initially tethered to a unique chromosome on which it occurred, marked by a distinct combination of genetic variants. Recombination and mutation subsequently act to erode this association, but do so slowly (each occurring at an average rate of about 1028 per base pair (bp) per generation) as compared to the number of generations (typically 104 to 105 ) since the mutational event. The correlations between causal mutations and the haplotypes on which they arose have long served as a tool for human genetic research: first finding association to a haplotype, and then subsequently identifying the causal mutation(s) that it carries. This was pioneered in studies of the HLA region, extended to identify causal genes for mendelian diseases (for example, cystic fibrosis18 and diastrophic dysplasia19), and most recently for complex disorders such as age-related macular degeneration13–15. Early information documented the existence of LD in the human genome20,21; however, these studies were limited (for technical reasons) to a small number of regions with incomplete data, and general patterns were challenging to discern. With the sequencing of the human genome and development of high-throughput genomic methods, it became clear that the human genome generally displays more LD22 than under simple population genetic models23, and that LD is more varied across regions, and more segmentally structured24–30, than had previously been supposed. These observations indicated that LD-based methods would generally have great value (because nearby SNPs were typically correlated with many of their neighbours), and also that LD relationships would ARTICLES *Lists of participants and affiliations appear at the end of the paper. Vol 437|27 October 2005|doi:10.1038/nature04226 1299
ARTICLES NATUREIVol 437 27 October 2005 Table 1 Genotyping centres RIKEN 5,114.15.16,n7,19 Third Wave Invader Wellcome Trust Sanger Institute 1,6,10.B3.20 McGill University and Genome Quebec Innovation Centre 18q,22,X umina BeadEr road Institute of harvard and mit 4g, 7g. 18p, Y, mtDNA m mass Extend illumin Baylor College of Medicine with Par Allele BioScience University of California, San Francisco, with Washington University in St Louis PerkinElmer AcycloPrime-FP Perlegen Sciences 5Mb(ENCODE)on 2, 4, 7 igh-density oligonucleotide array The Chinese HapMap Consortium consists of the Beijing Genomics Institute, the Chinese National Human genome Center at Beijing, the University of Hong Kong, the hong Kong University Hong Kong, and th need to be empirically determined across the genome by studying ( for example, Yoruba in Ibadan, Nigeria)to describe the samples polymorphisms at high density in population samples initially. Because the CHB and JPT allele frequencies are generally The International Hap Map Project was launched in October 2002 very similar, some analyses below combine these data sets. When to create a public, genome-wide database of common human doing so, we refer to three analysis panels(YRI, CEU, CHB-+JPT)to sequence variation, providing information needed as a guic confusing this analytic ch with th genetic studies of clinical phenotypes. The project had become population practical by the confluence of the following: (1)the availability of Important details about the design of the Hap Map Project are the human genome sequence;(2)databases of common SNPs presented in the Methods, including:(1)organization of the project subsequently enriched by this project) from which genotyping (2)selection of DNA samples for study; (3)increasing the number assa,, inexpensive, accurate technologies for high-throughput SNP 2.6 million to 9.2 million( Fig. 1); (4)targeted sequencing of the ter and annotation of SNPs in the public SNP map(dbSNP) from genotyping;(5)web-based tools for storing and sharing data; and ENCODE regions, including evaluations of false-positive and false (6)frameworks to address associated ethical and cultural issues. negative rates;(5)genotyping for the genome-wide map; (6)intense The project follows the data release principles of an international efforts that monitored and established the high quality of the data; communityresourceproject(http://www.wellcome.ac.uk/and(7)datacoordinationanddistributionthroughtheprojectData doc_wtdo03208.html),sharinginformationrapidlyandwithoutCoordinationCenter(dcc)(http://www.hapmap.org) restriction on its use Description of the data. The Phase I HapMap contains 1,007, 329 The HapMap data were generated with the primary aim of guiding SNPs that passed a set of quality control(QC)filters(see Methods)in he design and analysis of medical genetic studies. In addition, the each of the three analysis panels, and are polymorphic across the 269 advent of genome-wide variation resources such as the HapMap samples. SNP genotyping was distributed across centres by chromo- opens a new era in population genetics, offering an unprecedented omal region, with several technologies employed(Table 1). Each opportunity to investigate the evolutionary forces that have shaped centre followed the same standard rules for SNP selection, quality variation in natural populations. control and data release; all SNPs were genotyped in the full set of 269 samples. Some centres genotyped more SNPs than required by the The Phase I HapMap Phase I of the HapMap Project set as a goal genotyping at least one Extensive, blinded quality assessment(QA)exercises documented common SNPevery 5 kilobases(kb)across the genome in each of 269 that these data are highly accurate(99.7%)and complete(99.3%,see DNA samples. For the sake of practicality, and motivated by the allele frequency distribution of variants in the human genome a minor allele frequency(MAF)of 0.05 or greater was targeted for study. (For 10 mplicity, in this paper we will use the term common'to m SNP with MAF 20.05. ) The project has a Phase II, which is 2 attempting genotyping of an additional 4.6 million SNPs in each of98 the Hap Map samples. To compare the genome-wide resource to a more complete 4 database of common variation-one in which all common SNPs 0 6 and many rarer ones have been discovered and tested--a representa- o tive collection of ten regions, each 500 kb in length, was selected from the ENCODE (Encyclopedia of DNA Elements) Project. Each 500-kb region was sequenced in 48 individuals, and all SNPs in these regions (discovered or in dbSNP) were genotyped in the omplete set of 269 DNA samples The specific samples examined are:(1)90 individuals (station 01 02 03 o4 a1 02 03 04 Q1 02 Q3 o4 a1 a2 a3 04 o1 @2 @3 o4 o1 02 Q3 YRI); (2)90 individuals (30 trios)in Utah, USA, from the Centre d'Etude du Polymorphisme Humain collection(abbreviation CEU 2004 2005 (3)45 Han Chinese in Beijing, China(abbreviation CHB);(4)44 Japanese in Tokyo, Japan(abbreviation JPT ause none of the samples was collected to be representative of a gure 1 Number of SNPs in dbSNP over time. The cumulative number of non-redundant SNPs(each mapped to a single location larger population such as Yoruba, Northern and Western European, shown as a solid line. as well as the number of SNps valida Han Chinese, or Japanese(let alone of all populations from Africa, (dotted line) and double-hit status(dashed line). Years Europe, or Asia), we recommend using a specific local identifier quarters(Q1-Q4 2005 Nature Publishing Group
© 2005 Nature Publishing Group need to be empirically determined across the genome by studying polymorphisms at high density in population samples. The International HapMap Project was launched in October 2002 to create a public, genome-wide database of common human sequence variation, providing information needed as a guide to genetic studies of clinical phenotypes31. The project had become practical by the confluence of the following: (1) the availability of the human genome sequence; (2) databases of common SNPs (subsequently enriched by this project) from which genotyping assays could be designed; (3) insights into human LD; (4) development of inexpensive, accurate technologies for high-throughput SNP genotyping; (5) web-based tools for storing and sharing data; and (6) frameworks to address associated ethical and cultural issues32. The project follows the data release principles of an international community resource project (http://www.wellcome.ac.uk/ doc_WTD003208.html), sharing information rapidly and without restriction on its use. The HapMap data were generated with the primary aim of guiding the design and analysis of medical genetic studies. In addition, the advent of genome-wide variation resources such as the HapMap opens a new era in population genetics, offering an unprecedented opportunity to investigate the evolutionary forces that have shaped variation in natural populations. The Phase I HapMap Phase I of the HapMap Project set as a goal genotyping at least one common SNP every 5 kilobases (kb) across the genome in each of 269 DNA samples. For the sake of practicality, and motivated by the allele frequency distribution of variants in the human genome, a minor allele frequency (MAF) of 0.05 or greater was targeted for study. (For simplicity, in this paper we will use the term ‘common’ to mean a SNP with MAF $ 0.05.) The project has a Phase II, which is attempting genotyping of an additional 4.6 million SNPs in each of the HapMap samples. To compare the genome-wide resource to a more complete database of common variation—one in which all common SNPs and many rarer ones have been discovered and tested—a representative collection of ten regions, each 500 kb in length, was selected from the ENCODE (Encyclopedia of DNA Elements) Project33. Each 500-kb region was sequenced in 48 individuals, and all SNPs in these regions (discovered or in dbSNP) were genotyped in the complete set of 269 DNA samples. The specific samples examined are: (1) 90 individuals (30 parent– offspring trios) from the Yoruba in Ibadan, Nigeria (abbreviation YRI); (2) 90 individuals (30 trios) in Utah, USA, from the Centre d’Etude du Polymorphisme Humain collection (abbreviation CEU); (3) 45 Han Chinese in Beijing, China (abbreviation CHB); (4) 44 Japanese in Tokyo, Japan (abbreviation JPT). Because none of the samples was collected to be representative of a larger population such as ‘Yoruba’, ‘Northern and Western European’, ‘Han Chinese’, or ‘Japanese’ (let alone of all populations from ‘Africa’, ‘Europe’, or ‘Asia’), we recommend using a specific local identifier (for example, ‘Yoruba in Ibadan, Nigeria’) to describe the samples initially. Because the CHB and JPT allele frequencies are generally very similar, some analyses below combine these data sets. When doing so, we refer to three ‘analysis panels’ (YRI, CEU, CHBþJPT) to avoid confusing this analytical approach with the concept of a ‘population’. Important details about the design of the HapMap Project are presented in the Methods, including: (1) organization of the project; (2) selection of DNA samples for study; (3) increasing the number and annotation of SNPs in the public SNP map (dbSNP) from 2.6 million to 9.2 million (Fig. 1); (4) targeted sequencing of the ten ENCODE regions, including evaluations of false-positive and falsenegative rates; (5) genotyping for the genome-wide map; (6) intense efforts that monitored and established the high quality of the data; and (7) data coordination and distribution through the project Data Coordination Center (DCC) (http://www.hapmap.org). Description of the data. The Phase I HapMap contains 1,007,329 SNPs that passed a set of quality control (QC) filters (see Methods) in each of the three analysis panels, and are polymorphic across the 269 samples. SNP genotyping was distributed across centres by chromosomal region, with several technologies employed (Table 1). Each centre followed the same standard rules for SNP selection, quality control and data release; all SNPs were genotyped in the full set of 269 samples. Some centres genotyped more SNPs than required by the rules. Extensive, blinded quality assessment (QA) exercises documented that these data are highly accurate (99.7%) and complete (99.3%, see Table 1 | Genotyping centres Centre Chromosomes Technology RIKEN 5, 11, 14, 15, 16, 17, 19 Third Wave Invader Wellcome Trust Sanger Institute 1, 6, 10, 13, 20 Illumina BeadArray McGill University and Ge´nome Que´bec Innovation Centre 2, 4p Illumina BeadArray Chinese HapMap Consortium* 3, 8p, 21 Sequenom MassExtend, Illumina BeadArray Illumina 8q, 9, 18q, 22, X Illumina BeadArray Broad Institute of Harvard and MIT 4q, 7q, 18p, Y, mtDNA Sequenom MassExtend, Illumina BeadArray Baylor College of Medicine with ParAllele BioScience 12 ParAllele MIP University of California, San Francisco, with Washington University in St Louis 7p PerkinElmer AcycloPrime-FP Perlegen Sciences 5 Mb (ENCODE) on 2, 4, 7, 8, 9, 12, 18 in CEU High-density oligonucleotide array *The Chinese HapMap Consortium consists of the Beijing Genomics Institute, the Chinese National Human Genome Center at Beijing, the University of Hong Kong, the Hong Kong University of Science and Technology, the Chinese University of Hong Kong, and the Chinese National Human Genome Center at Shanghai. Figure 1 | Number of SNPs in dbSNP over time. The cumulative number of non-redundant SNPs (each mapped to a single location in the genome) is shown as a solid line, as well as the number of SNPs validated by genotyping (dotted line) and double-hit status (dashed line). Years are divided into quarters (Q1–Q4). ARTICLES NATURE|Vol 437|27 October 2005 1300
NATUREIVol 437 27 October 2005 ARTICLES also Supplementary Table 1). All genotyping centres produced high-(Supplementary Fig. 1): only 3.3% of inter-SNP distances are longer ality data(accuracy more than 99% in the blind Q4 upplementary Tables 2 and 3), and missing data were erozygotes. The Supplementary Information m than 10 kb, spanning 11.9% of the genome( Fig. 2; see also Sup- plementary Fig. 2). One exception is the X chromosome( Suppleme tary Fig. 1), where a much higher proportion of attempted SNPs were full details of these efforts rare or monomorphic, and thus the density of common SNPs is lower. Although SNP selection was generally agnostic to functional Two intentional exceptions to the regular spacing of SNPs on the annotation,11,500 non-synonymous CSNPs(SNPs in coding regions physical map were the mitochondrial chromosome(mtDNA), which of genes where the different SNP alleles code for different amino acids does not undergo recombination, and the non-recombining portion in the protein) were successfully typed in Phase I(An effort was of chromosome Y. On the basis of the 168 successful, polymorphic made to prioritize cSNPs in Phase I in choosing SNPs for each 5-kb SNPs, each HapMap sample fell into one of 15(of the 18 known egion; all known non-synonymous CSNPs were attempted as part of mtDNA haplogroups"(Table 4). A total of 84 SNPs that charac Phase IL.) the unique branches of the reference Y genealogical tree-3were Across the ten ENCODE regions(Table 2), the density of SNPs was genotyped on the Hap Map samples. These SNPs assigned each Y approximately tenfold higher as compared to the genome-wide map: chromosome to 8(of the 18 major)Y haplogroups previous 17,944 SNPs across the 5 megabases(Mb)(one per 279 bp) described(Table 4) More than 1.3 million SNP genotyping assays were attempted Highly accurate phasing of long-range chromosomal haplotypes. (Table 3)to generate the Phase I data on more than 1 million SNPs. Despite having collected data in diploid individuals, the inclusion of The 0.3 million SNPs not part of the Phase I data set include 73, 652 that parent-offspring trios and the use of computational methods made it passed QC filters but were monomorphic in all 269 samples. The possible to determine long-range phased haplotypes of extremely remaining SNPs failed the QC filters in one or more analysis panels high quality for each individual. These computational algorithms mostly because of inadequate completeness, non-mendelian inheri- take advantage of the observation that because of LD, relatively few of tance, deviations from Hardy-Weinberg equilibrium, discrepant the large number of possible haplotypes consistent with the genotype notypes among duplicates, and data transmission discrep y and The project compared a variety of algorithms for phasing haplo- mtDNA. The Phase I data include a successful, common SNP types from unrelated individuals and trios and applied the algo every 5 kb across most of the genome in each analysis panel rithm that proved most accurate(an updated version of PHASE) Table 2 ENCODE project regions Available SNPs Region Chromosome(bas nsity Conservation G+C )t (%)5 score(%)5 (cM Mb -) content# dbSNP* Sequence" Total 二 51633.239- 70 1.762 33322,275 ENr312q371 234,778,639-4.6 0431,736 1,259 2995 444 2,053 3,4972,201 ENm01o7p15226,6 20 0.44 795 3015 ENm0137q211389395,71 0381,394 9,895,717 ENm0147q3133126135,436-29 11.2 0 2,984 0411430 9,269627 ENr2329q34127,061,347-5983 0521,444 296 1,324 12a12 38,626,4 3.1 0.3 0.36 37932561792 Baylor EN2318q12123,717,221 1459 2,7891,640 24,217 4,76516,3193108417944 ed to 500 kb for resequencing. Gene density is defined as the percentage of bases covered either by ensembl genes or human mRNa best BLAT alignments in the uCSC Genome Browser database ons corresponding to the following were discarded: Ensembl genes, all GenBank mRNA Blastz alignments, FGenesh++gene predictions, Twinscan gene predictions, spliced E> ents is eir ent were discarded. of th V ast 80% base identity we ination rate based on estimates from lD SNPs in dbSNP build 121 at the time the ENCODE resequencing began DSNP in builds 122-125 independent of the resequencing in builds 122-125) tt SNPs successfully genotyped in all analysis panels (YRI, CEU, CHB+JP +* Perlegen genotyped a subset of SNPs in the CEU sample 1301 2005 Nature Publishing Group
© 2005 Nature Publishing Group also Supplementary Table 1). All genotyping centres produced highquality data (accuracy more than 99% in the blind QA exercises, Supplementary Tables 2 and 3), and missing data were not biased against heterozygotes. The Supplementary Information contains the full details of these efforts. Although SNP selection was generally agnostic to functional annotation, 11,500 non-synonymous cSNPs (SNPs in coding regions of genes where the different SNP alleles code for different amino acids in the protein) were successfully typed in Phase I. (An effort was made to prioritize cSNPs in Phase I in choosing SNPs for each 5-kb region; all known non-synonymous cSNPs were attempted as part of Phase II.) Across the ten ENCODE regions (Table 2), the density of SNPs was approximately tenfold higher as compared to the genome-wide map: 17,944 SNPs across the 5 megabases (Mb) (one per 279 bp). More than 1.3 million SNP genotyping assays were attempted (Table 3) to generate the Phase I data on more than 1 million SNPs. The 0.3 million SNPs not part of the Phase I data set include 73,652 that passed QC filters but were monomorphic in all 269 samples. The remaining SNPs failed the QC filters in one or more analysis panels mostly because of inadequate completeness, non-mendelian inheritance, deviations from Hardy–Weinberg equilibrium, discrepant genotypes among duplicates, and data transmission discrepancies. SNPs on the Phase I map are evenly spaced, except on Y and mtDNA. The Phase I data include a successful, common SNP every 5 kb across most of the genome in each analysis panel (Supplementary Fig. 1): only 3.3% of inter-SNP distances are longer than 10 kb, spanning 11.9% of the genome (Fig. 2; see also Supplementary Fig. 2). One exception is the X chromosome (Supplementary Fig. 1), where a much higher proportion of attempted SNPs were rare or monomorphic, and thus the density of common SNPs is lower. Two intentional exceptions to the regular spacing of SNPs on the physical map were the mitochondrial chromosome (mtDNA), which does not undergo recombination, and the non-recombining portion of chromosome Y. On the basis of the 168 successful, polymorphic SNPs, each HapMap sample fell into one of 15 (of the 18 known) mtDNA haplogroups34 (Table 4). A total of 84 SNPs that characterize the unique branches of the reference Y genealogical tree35–37 were genotyped on the HapMap samples. These SNPs assigned each Y chromosome to 8 (of the 18 major) Y haplogroups previously described (Table 4). Highly accurate phasing of long-range chromosomal haplotypes. Despite having collected data in diploid individuals, the inclusion of parent–offspring trios and the use of computational methods made it possible to determine long-range phased haplotypes of extremely high quality for each individual. These computational algorithms take advantage of the observation that because of LD, relatively few of the large number of possible haplotypes consistent with the genotype data actually occur in population samples. The project compared a variety of algorithms for phasing haplotypes from unrelated individuals and trios38, and applied the algorithm that proved most accurate (an updated version of PHASE39) Table 2 | ENCODE project regions and genotyping Region name Chromosome band Genomic interval (NCBI) (base numbers)† Gene density (%)‡ Conservation score (%)§ Pedigree-based recombination rate (cM Mb21 )k Populationbased recombination rate (cM Mb21 ){ GþC content# dbSNPq Available SNPs Sequence** Total Successfully genotyped SNPs†† Sequencing centre/ genotyping centre(s)‡‡ ENr112 2p16.3 51,633,239– 52,133,238 0 3.8 0.8 0.9 0.35 1,570 1,762 3,332 2,275 Broad/ McGillGQIC ENr131 2q37.1 234,778,639– 235,278,638 4.6 1.3 2.2 2.5 0.43 1,736 1,259 2,995 1,910 Broad/ McGillGQIC ENr113 4q26 118,705,475– 119,205,474 0 3.9 0.6 0.9 0.35 1,444 2,053 3,497 2,201 Broad/ Broad ENm010 7p15.2 26,699,793– 27,199,792 5.0 22.0 0.9 0.9 0.44 1,220 1,795 3,015 1,271 Baylor/ UCSF-WU, Broad ENm013* 7q21.13 89,395,718– 89,895,717 5.5 4.4 0.4 0.5 0.38 1,394 1,917 3,311 1,807 Broad/ Broad ENm014* 7q31.33 126,135,436– 126,632,577 2.9 11.2 0.4 0.9 0.39 1,320 1,664 2,984 1,966 Broad/ Broad ENr321 8q24.11 118,769,628– 119,269,627 3.2 11.4 0.6 1.1 0.41 1,430 1,508 2,938 1,758 Baylor/ Illumina ENr232 9q34.11 127,061,347– 127,561,346 5.9 8.3 2.7 2.6 0.52 1,444 1,523 2,967 1,324 Baylor/ Illumina ENr123 12q12 38,626,477– 39,126,476 3.1 1.7 0.3 0.8 0.36 1,877 1,379 3,256 1,792 Baylor / Baylor ENr213 18q12.1 23,717,221– 24,217,220 0.9 7.4 1.2 0.9 0.37 1,330 1,459 2,789 1,640 Baylor/ Illumina Total – – – – – – – 14,765 16,319 31,084 17,944 – McGill-GQIC, McGill University and Ge´nome Que´bec Innovation Centre. *These regions were truncated to 500 kb for resequencing. †Sequence build 34 coordinates. ‡Gene density is defined as the percentage of bases covered either by Ensembl genes or human mRNA best BLAT alignments in the UCSC Genome Browser database. §Non-exonic conservation with mouse sequence was measured by taking 125 base non-overlapping sub-windows inside the 500,000 base windows. Sub-windows with less than 75% of their bases in a mouse alignment were discarded. Of the remaining sub-windows, those with at least 80% base identity were used to calculate the conservation score. The mouse alignments in regions corresponding to the following were discarded: Ensembl genes, all GenBank mRNA Blastz alignments, FGeneshþþ gene predictions, Twinscan gene predictions, spliced EST alignments, and repeats. kThe pedigree-based sex-averaged recombination map is from deCODE Genetics48. {Recombination rate based on estimates from LDhat46. #G þ C content calculated from the sequence of the stated coordinates from sequence build 34. qSNPs in dbSNP build 121 at the time the ENCODE resequencing began and SNPs added to dbSNP in builds 122–125 independent of the resequencing. **New SNPs discovered through the resequencing reported here (not found by other means in builds 122–125). ††SNPs successfully genotyped in all analysis panels (YRI, CEU, CHB þ JPT). ‡‡Perlegen genotyped a subset of SNPs in the CEU samples. NATURE|Vol 437|27 October 2005 ARTICLES 1301
ARTICLES NATUREIVol 437 27 October 2005 Table 3 HapMap Phase I genotyping success measures SNP categories Assays submitt 1273,716 1302849 1,273,703 Passed QC filters 1123296(88%) 157,650(89%) Did not pass QC filters 150,420(12%) 14519901% 138977(11%) 986(65%) 107,626(74%) 22,815(15% 13600(9% 20 Figure 2 Distribution of inter- SNP distances. The distributions are shown for each analysis panel for the Hap Mappable genome(defined in the Methods), for all common SNPs(with MAF 20.05) 2005 Nature Publishing Group
© 2005 Nature Publishing Group separately to each analysis panel. (Phased haplotypes are available for download at the Project website.) We estimate that ‘switch’ errors— where a segment of the maternal haplotype is incorrectly joined to the paternal—occur extraordinarily rarely in the trio samples (every 8 Mb in CEU; 3.6 Mb in YRI). The switch rate is higher in the CHBþJPT samples (one per 0.34 Mb) due to the lack of information from parent–offspring trios, but even for the unrelated individuals, statistical reconstruction of haplotypes is remarkably accurate. Estimating properties of SNP discovery and dbSNP. Extensive sequencing and genotyping in the ENCODE regions characterized the false-positive and false-negative rates for dbSNP, as well as polymerase chain reaction (PCR)-based resequencing (see Methods). These data reveal two important conclusions: first, that PCR-based sequencing of diploid samples may be biased against very rare variants (that is, those seen only as a single heterozygote), and second, that the vast majority of common variants are either represented in dbSNP, or show tight correlation to other SNPs that are in dbSNP (Fig. 3). Allele frequency distributions within population samples. The underlying allele frequency distributions for these samples are best Figure 2 | Distribution of inter-SNP distances. The distributions are shown for each analysis panel for the HapMappable genome (defined in the Methods), for all common SNPs (with MAF $ 0.05). Table 3 | HapMap Phase I genotyping success measures Analysis panel SNP categories YRI CEU CHB þ JPT Assays submitted 1,273,716 1,302,849 1,273,703 Passed QC filters 1,123,296 (88%) 1,157,650 (89%) 1,134,726 (89%) Did not pass QC filters* 150,420 (12%) 145,199 (11%) 138,977 (11%) . 20% missing data 98,116 (65%) 107,626 (74%) 93,710 (67%) . 1 duplicate inconsistent 7,575 (5%) 6,254 (4%) 10,725 (8%) . 1 mendelian error 22,815 (15%) 13,600 (9%) 0 (0%) , 0.001 Hardy–Weinberg P-value 12,052 (8%) 9,721 (7%) 16,176 (12%) Other failures† 23,478 (16%) 17,692 (12%) 23,722 (17%) Non-redundant (unique) SNPs 1,076,392 1,104,980 1,087,305 Monomorphic 156,290 (15%) 234,482 (21%) 268,325 (25%) Polymorphic 920,102 (85%) 870,498 (79%) 818,980 (75%) All analysis panels Unique QC-passed SNPs 1,156,772 Passed in one analysis panel 52,204 (5%) Passed in two analysis panels 97,231 (8%) Passed in three analysis panels 1,007,337 (87%) Monomorphic across three analysis panels 75,997 Polymorphic in all three analysis panels 682,397 MAF $ 0.05 in at least one of three analysis panels 877,351 *Out of 95 samples in CEU, YRI; 94 samples in CHB þ JPT. †‘Other failures’ includes SNPs with discrepancies during the data transmission process. Some SNPs failed in more than one way, so these percentages add up to more than 100%. ARTICLES NATURE|Vol 437|27 October 2005 1302
NATUREIVol 437 27 October 2005 ARTICLES estimated from the ENCODE data, where deep sequencing reduces A simple measure of population differentiation is Wright's Fs bias due to SNP ascertainment. Consistent with previous studies, which measures the fraction of total genetic variation due to most SNPs observed in the ENCODE regions are rare: 46% had between-population differences". Across the autosomes, Fsr esti- MAF <0.05, and 9% were seen in only a single individual(Fig. 4). mated from the full set of Phase I data is 0.12, with Although most varying sites in the population are rare, most CHB+JPT showing the lowest level of differentiation(FST =0.07), heterozygous sites within any individual are due to common SNPs. and YRI and CHB-+JPT the highest(FST=0.12). These values are Specifically, in the ENCODE data, 90% of heterozygous sites in each slightly higher than previous reports, but differences in the types of individual were due to common variants(Fig. 4). With ever-deeper variants(SNPs versus microsatellites)and the samples studied make quencing of DNA samples the number of rare variants will rise comparisons difficult. LD) in existing databases(Fig 3) anels). Across the I million SNPs genotyped, only ll have fixed Consistent with previous descriptions, the CEU, CHB and JPt differences between CEU and YRI, 21 between CEU and CHB-+JPT, samples show fewer low frequency alleles when compared to the Yri and 5 between YRI and CHB+JPT, for the autosomes samples(Fig. 5), a pattern thought to be due to bottlenecks in the The extent of differentiation is similar across the autosomes, but history of the non-YRI populations higher on the X chromosome (FST=0.21). Interestingly, 123 SNPs In contrast to the ENCODE data, the distribution of allele on the X chromosome were completely differentiated between frequencies for the genome-wide data is flat(Fig. 5), with much YRI and CHB+JPT, but only two between CEU and YRI and one more similarity in the distributions observed in the three analysis between CEU and CHB+JPT. This seems to be largely due to a single panels. These patterns are well explained by the inherent and region near the centromere, possibly indicating a history of natural intentional bias in the rules used for SNP selection: we prioritized using validated SNPs in order to focus resources on common(rather an rare or false positive)candidate SNPs from the public databases For a fuller discussion of ascertainment issues, including a shift in frequencies over time and an excess of high-frequency derived alleles due to inclusion of chimpanzee data in determination of double-hit 家 status,see the Supplementary Information( Supplementary Fig 3 SNP allele frequencies ss population samples. Of the 1.007 o 60 million SNPs successfully genotyped and polymorphic across the three analysis panels, only a subset were polymorphic in any given940 panel:85% in YRI, 79% in CEU, and 75% in CHB+JPT. The joint 8 2 distribution of frequencies across populations is presented in Fig.6 a (for the ENCODE data)and Supplementary Fig 4(for the genome wide map). We note the similarity of allele frequencies in the CHB and JPTsamples, which motivates analysing them jointly as a single analysis panel in the remainder of this report Table 4 mtDNA and Y chromosome haplogroups 8z6588 CEU (60) JPT(44) LI 31835678639616813697512SNPs 043 0 0z0coto =557 DNA sample Figure 3 Allele frequency and completeness of dbsNP for the ENCODE regions. a-c, The fraction of SNPs in dbSNP, or with a proxy in dbSNP, are Y chromosome haplogroup YRI(30 CEU (30) CHB (22) JPT (22) shown as a function of minor allele frequency foreach analysis panel(a, YRI; b, CEU; C, CHB-+JPT). Singletons refer to heterozygotes observed in a single dividual, and are broken out from other SNps with maF 005 Because FH K 0.23 0.14 lI ENCODE SNPs have been dep SNP as"in dbSNP'if it would be in dbSNP build 125 independent of the 0.70 HapMap ENCODE resequencing project. All remaining SNPs(not 0.09 dbsNP)were discovered only by ENCODE resequencing; they are 0. 45 categorized by their correlation(r2)to those in dbSNP. Note that the number of SNPs in each frequency bin differs among analysis pane because not all SNPs are polymorphic in all ana 1303 2005 Nature Publishing Group
© 2005 Nature Publishing Group estimated from the ENCODE data, where deep sequencing reduces bias due to SNP ascertainment. Consistent with previous studies, most SNPs observed in the ENCODE regions are rare: 46% had MAF , 0.05, and 9% were seen in only a single individual (Fig. 4). Although most varying sites in the population are rare, most heterozygous sites within any individual are due to common SNPs. Specifically, in the ENCODE data, 90% of heterozygous sites in each individual were due to common variants (Fig. 4). With ever-deeper sequencing of DNA samples the number of rare variants will rise linearly, but the vast majority of heterozygous sites in each person will be explained by a limited set of common SNPs now contained (or captured through LD) in existing databases (Fig. 3). Consistent with previous descriptions, the CEU, CHB and JPT samples show fewer low frequency alleles when compared to the YRI samples (Fig. 5), a pattern thought to be due to bottlenecks in the history of the non-YRI populations. In contrast to the ENCODE data, the distribution of allele frequencies for the genome-wide data is flat (Fig. 5), with much more similarity in the distributions observed in the three analysis panels. These patterns are well explained by the inherent and intentional bias in the rules used for SNP selection: we prioritized using validated SNPs in order to focus resources on common (rather than rare or false positive) candidate SNPs from the public databases. For a fuller discussion of ascertainment issues, including a shift in frequencies over time and an excess of high-frequency derived alleles due to inclusion of chimpanzee data in determination of double-hit status, see the Supplementary Information (Supplementary Fig. 3). SNP allele frequencies across population samples. Of the 1.007 million SNPs successfully genotyped and polymorphic across the three analysis panels, only a subset were polymorphic in any given panel: 85% in YRI, 79% in CEU, and 75% in CHBþJPT. The joint distribution of frequencies across populations is presented in Fig. 6 (for the ENCODE data) and Supplementary Fig. 4 (for the genomewide map). We note the similarity of allele frequencies in the CHB and JPT samples, which motivates analysing them jointly as a single analysis panel in the remainder of this report. A simple measure of population differentiation is Wright’s FST, which measures the fraction of total genetic variation due to between-population differences40. Across the autosomes, FST estimated from the full set of Phase I data is 0.12, with CEU and CHBþJPT showing the lowest level of differentiation (FST ¼ 0.07), and YRI and CHBþJPT the highest (FST ¼ 0.12). These values are slightly higher than previous reports41, but differences in the types of variants (SNPs versus microsatellites) and the samples studied make comparisons difficult. As expected, we observed very few fixed differences (that is, cases in which alternate alleles are seen exclusively in different analysis panels). Across the 1 million SNPs genotyped, only 11 have fixed differences between CEU and YRI, 21 between CEU and CHBþJPT, and 5 between YRI and CHBþJPT, for the autosomes. The extent of differentiation is similar across the autosomes, but higher on the X chromosome (FST ¼ 0.21). Interestingly, 123 SNPs on the X chromosome were completely differentiated between YRI and CHBþJPT, but only two between CEU and YRI and one between CEU and CHBþJPT. This seems to be largely due to a single region near the centromere, possibly indicating a history of natural Table 4 | mtDNA and Y chromosome haplogroups DNA sample* MtDNA haplogroup YRI (60) CEU (60) CHB (45) JPT (44) L1 0.22 – – – L2 0.35 – – – L3 0.43 – – – A – – 0.13 0.04 B – – 0.33 0.30 C – – 0.09 0.07 D – – 0.22 0.34 M/E – – 0.22 0.25 H – 0.45 – – V – 0.07 – – J – 0.08 – – T – 0.12 – – K – 0.03 – – U – 0.23 – – W – 0.02 – – DNA sample* Y chromosome haplogroup YRI (30) CEU (30) CHB (22) JPT (22) E1 0.07 – – – E3a 0.93 – – – F, H, K – 0.03 0.23 0.14 I – 0.27 – – R1 – 0.70 – – C – – 0.09 0.09 D – – – 0.45 NO – – 0.68 0.32 *Number of chromosomes sampled is given in parentheses. Figure 3 | Allele frequency and completeness of dbSNP for the ENCODE regions. a–c, The fraction of SNPs in dbSNP, or with a proxy in dbSNP, are shown as a function of minor allele frequency for each analysis panel (a, YRI; b, CEU; c, CHBþJPT). Singletons refer to heterozygotes observed in a single individual, and are broken out from other SNPs with MAF , 0.05. Because all ENCODE SNPs have been deposited in dbSNP, for this figure we define a SNP as ‘in dbSNP’ if it would be in dbSNP build 125 independent of the HapMap ENCODE resequencing project. All remaining SNPs (not in dbSNP) were discovered only by ENCODE resequencing; they are categorized by their correlation (r 2 ) to those in dbSNP. Note that the number of SNPs in each frequency bin differs among analysis panels, because not all SNPs are polymorphic in all analysis panels. NATURE|Vol 437|27 October 2005 ARTICLES 1303
ARTICLES NATUREIVol 437 27 October 2005 Properties of LD in the human genome Traditionally, des ptions of Ld have focuse ed on measures calcu- lated between pairs of SNPs, averaged as a function of physical distance. Examples of such analyses for the HapMap data are presented in Supplementary Fig. 6. After adjusting for known confounders such as sample size, allele frequency distribution, marker density, and length of sampled regions, these data are highly similar to previously published surveys" Because LD varies markedly on scales of 1-100 kb, and is often discontinuous rather than declining smoothly with distance. averages obscure important aspects of LD structure. A fuller explora tion of the fine-scale structure of LD offers both insight into the causes of LD and understanding of its application to disease research LD patterns are simple in the absence of recombination. The most 0.5 natural path to understanding LD structure is first to consider the implest case in which there is no recombination (or gene conver Figure 4 Minor allele frequency distribution of SNPs in the ENCODE data, on), and then to add recombination to the model.( For simplicity and their contribution to heterozygosity. This figure shows the we ignore genotyping error and recurrent mutation in this discus polymorphic SNPs from the HapMap ENCODE regions according to mi sion, both of which seem to be rare in these data allele frequency(blue), with the lowest minor allele frequency bin(<0.05) In the absence of recombination, diversity arises solely through separated into singletons(SNPs he gous in one individual only, shown mutation. Because each SNP arose on a particular branch of the grey)and SNPs with more than one heterozygous individual. For this genealogical tree relating the chromosomes in the current popu alysis, MAF is averaged across the analysis panels. The sum of the lations, multiple haplotypes are observed. SNPs that arose on the ntribution of each MaF bin to the overall heterozygosity of the ENCODE same branch of the genealogy are perfectly correlated in the sample, egions is also shown (orange) whereas SNPs that occurred on different branches have imperfect correlations, or no correlation at all. We illustrate these concepts using empirical genotype data from 36 selection at this locus(see below; M. L. Freedman et al., personal adjacent SNPs in an ENCODE region(ENr1312q 37), selected because no obligate recombination events were detecta e anon aplotype sharing across populations. We next examined the them in CEU (Fig. 7). (We note that the lack of obligate recombina extent to which haplotypes are shared across populations. We used tion events in a small sample does not guarantee that no recombi a hidden Markov model in which each haplotype is modelled in turn nants have occurred, but it provides a good approximation for as an imperfect mosaic of other haplotypes(see Supplementary illustration. Information). In essence, the method infers probabilistically In principle, 36 such SNPs could give rise to 26 different haplo- which other haplotype in the sample is the closest relative(nearest types. Even with no recombination, gene conversion or recurrent neighbour)at each position along the chromosome 9. Unsurprisingly, the nearest neighbour most often is from the same great potential diversity, only seven haplotypes are observed(five to match a haplotype in another panel (Supplementary Fig. 5). All studied, reflecting shared ancestry since their most recent common individuals have at least some segments over which the nearest ancestor among apparently unrelated individuals neighbour is in a diffe erent a alysis panel. These results indicate In such a setting, it is easy to interpret the two most common nat althor alysis panels are characterized both by different pairwise measures of LD: D and r.(See the Supplementary haplotype frequencies and, to some extent, different combinations of Information for fuller definitions of these measures. )D is defined leles, both common and rare haplotypes are often shared across to be I in the absence of obligate recombination, declining only due aton to recombination or recurrent mutation. In contrast, r is simply CEU CHB+JPT 0.2 0.1 00.1020.30.40.500.10.20.3040.500.102030.40.5 Minor allele frequency analysis panel we plotted(bars)the MAF distribution of all the Phas distribution expected for the standard neutral population with a frequency greater than zero. The solid line shows the MAF constant population size and random mating without asd 2005 Nature Publishing Group
© 2005 Nature Publishing Group selection at this locus (see below; M. L. Freedman et al., personal communication). Haplotype sharing across populations. We next examined the extent to which haplotypes are shared across populations. We used a hidden Markov model in which each haplotype is modelled in turn as an imperfect mosaic of other haplotypes (see Supplementary Information)42. In essence, the method infers probabilistically which other haplotype in the sample is the closest relative (nearest neighbour) at each position along the chromosome. Unsurprisingly, the nearest neighbour most often is from the same analysis panel, but about 10% of haplotypes were found most closely to match a haplotype in another panel (Supplementary Fig. 5). All individuals have at least some segments over which the nearest neighbour is in a different analysis panel. These results indicate that although analysis panels are characterized both by different haplotype frequencies and, to some extent, different combinations of alleles, both common and rare haplotypes are often shared across populations. Properties of LD in the human genome Traditionally, descriptions of LD have focused on measures calculated between pairs of SNPs, averaged as a function of physical distance. Examples of such analyses for the HapMap data are presented in Supplementary Fig. 6. After adjusting for known confounders such as sample size, allele frequency distribution, marker density, and length of sampled regions, these data are highly similar to previously published surveys43. Because LD varies markedly on scales of 1–100 kb, and is often discontinuous rather than declining smoothly with distance, averages obscure important aspects of LD structure. A fuller exploration of the fine-scale structure of LD offers both insight into the causes of LD and understanding of its application to disease research. LD patterns are simple in the absence of recombination. The most natural path to understanding LD structure is first to consider the simplest case in which there is no recombination (or gene conversion), and then to add recombination to the model. (For simplicity we ignore genotyping error and recurrent mutation in this discussion, both of which seem to be rare in these data.) In the absence of recombination, diversity arises solely through mutation. Because each SNP arose on a particular branch of the genealogical tree relating the chromosomes in the current populations, multiple haplotypes are observed. SNPs that arose on the same branch of the genealogy are perfectly correlated in the sample, whereas SNPs that occurred on different branches have imperfect correlations, or no correlation at all. We illustrate these concepts using empirical genotype data from 36 adjacent SNPs in an ENCODE region (ENr131.2q37), selected because no obligate recombination events were detectable among them in CEU (Fig. 7). (We note that the lack of obligate recombination events in a small sample does not guarantee that no recombinants have occurred, but it provides a good approximation for illustration.) In principle, 36 such SNPs could give rise to 236 different haplotypes. Even with no recombination, gene conversion or recurrent mutation, up to 37 different haplotypes could be formed. Despite this great potential diversity, only seven haplotypes are observed (five seen more than once) among the 120 parental CEU chromosomes studied, reflecting shared ancestry since their most recent common ancestor among apparently unrelated individuals. In such a setting, it is easy to interpret the two most common pairwise measures of LD: D0 and r 2 . (See the Supplementary Information for fuller definitions of these measures.) D0 is defined to be 1 in the absence of obligate recombination, declining only due to recombination or recurrent mutation27. In contrast, r 2 is simply Figure 4 | Minor allele frequency distribution of SNPs in the ENCODE data, and their contribution to heterozygosity. This figure shows the polymorphic SNPs from the HapMap ENCODE regions according to minor allele frequency (blue), with the lowest minor allele frequency bin (,0.05) separated into singletons (SNPs heterozygous in one individual only, shown in grey) and SNPs with more than one heterozygous individual. For this analysis, MAF is averaged across the analysis panels. The sum of the contribution of each MAF bin to the overall heterozygosity of the ENCODE regions is also shown (orange). Figure 5 | Allele frequency distributions for autosomal SNPs. For each analysis panel we plotted (bars) the MAF distribution of all the Phase I SNPs with a frequency greater than zero. The solid line shows the MAF distribution for the ENCODE SNPs, and the dashed line shows the MAF distribution expected for the standard neutral population model with constant population size and random mating without ascertainment bias. ARTICLES NATURE|Vol 437|27 October 2005 1304
NATUREIVol 437 27 October 2005 ARTICLES the squared correlation coefficient between the two SNPs. Thus, r-is The availability of nearly complete information about common I when two SNPs arose on the same branch of the genealogy and DNA variation in the ENCODE regions allowed a more precise remain undisrupted by recombination, but has a value less than 1 estimation of recombination rates across large regions than in any hen SNPs arose on different branches, or if an initially strong previous study. We estimated recombination rates and identifie orrelation has been disrupted by crossing over. recombination hotspots in the ENCODE data, using methods haplotype structure, r- values display a complex pattern, varying which recombination rates rise dramatically over local background from 0.0003 to 1.0, with no relationship to physical distance. This rates makes sense, however, because without recombination, correlations Whereas the average recombination rate over 500 kb across the long SNPs depend on the historical order in which they arose, not human genome is about 0.5 cM", the estimated recombination rate the physical order of SNPs on the chromosome across the 500-kb ENCODE regions varied nearly tenfold, from a Most importantly, the seeming complexity of r values can be minimum of 0.19 cM(ENm0137q21 13)to a maximum of 1. 25 cM convolved in a simple manner: only seven different SNP configur-(ENr2329q34 11). Even this tenfold variation obscures much more ations exist in this region, with all but two chromosomes matching dramatic variation over a finer scale: 88 hotspots of recombination five common haplotypes, which can be distinguished from each were identified(Fig 8; see also Supplementary Fig. 7)-that is, one other by typing a specific set of four SNPs. That is, only a small per 57 kb-with hotspots detected in each of the ten regions(from 4 minority of sites need be examined to capture fully the information in 12q12 to 14 in 2q37. 1). Across the 5 Mb, we estimate that about in this region. 80% of all recombination has taken place in about 15% of the Variation in local recombination rates is a major determinant of sequence(Fig 9, see also refs 46, 49) LD Recombination in the ancestors of the current population has A block-like structure of human LD. With most human recombina- typically disrupted the simple picture presented above. In the human tion occurring in recombination hotspots, the breakdown of LD genome, as in yeast", mouse and other genomes, recombination is often discontinuous. A 'block-like structure of LD is visually rates typically vary dramatically on a fine scale, with hotspots of apparent in Fig 8 and Supplementary Fig. 7: segments of consistently recombination explaining much crossing over in each region2. The high D that break down where high recombination rates, recombi- generality of this model has recently been demonstrated through nation hotspots and obligate recombination eventsall cluster. tational methods that allow estimation of recombination rates When haplotype blocks are more formally defined in the (including hotspots and coldspots) from genotype data"d eNCOdE data(using a method based on a composite of local D 0.6 00.20.40.60.81.0 020 60.81.0 YRI allele frequency CEU allele frequency c1.0 d1.0 0.2 00.2040.6081.0 YRI allele frequency CHB allele frequency 00200300400500600+ of analysis panels and between the CHB and JPT sample sets. For each are common in one panel but.e Figure 6 Comparison of allele frequencies in the ENCODE data for all pairs given set of allele frequencies. The purple regions show that very few SNPs another. The red polymorphic SNP we identified the minor allele all panels(a-d)and there are many SNPs that have similar low frequencies in each pair then calculated the frequency of this allele in each analysis panel/sample set. analysis panels/sample sets The colour in each bin represents the number of SNPs that display each 2005 Nature Publishing Group
© 2005 Nature Publishing Group the squared correlation coefficient between the two SNPs. Thus, r 2 is 1 when two SNPs arose on the same branch of the genealogy and remain undisrupted by recombination, but has a value less than 1 when SNPs arose on different branches, or if an initially strong correlation has been disrupted by crossing over. In this region, D0 ¼ 1 for all marker pairs, as there is no evidence of historical recombination. In contrast, and despite great simplicity of haplotype structure, r 2 values display a complex pattern, varying from 0.0003 to 1.0, with no relationship to physical distance. This makes sense, however, because without recombination, correlations among SNPs depend on the historical order in which they arose, not the physical order of SNPs on the chromosome. Most importantly, the seeming complexity of r 2 values can be deconvolved in a simple manner: only seven different SNP configurations exist in this region, with all but two chromosomes matching five common haplotypes, which can be distinguished from each other by typing a specific set of four SNPs. That is, only a small minority of sites need be examined to capture fully the information in this region. Variation in local recombination rates is a major determinant of LD. Recombination in the ancestors of the current population has typically disrupted the simple picture presented above. In the human genome, as in yeast44, mouse45 and other genomes, recombination rates typically vary dramatically on a fine scale, with hotspots of recombination explaining much crossing over in each region28. The generality of this model has recently been demonstrated through computational methods that allow estimation of recombination rates (including hotspots and coldspots) from genotype data46,47. The availability of nearly complete information about common DNA variation in the ENCODE regions allowed a more precise estimation of recombination rates across large regions than in any previous study. We estimated recombination rates and identified recombination hotspots in the ENCODE data, using methods previously described46 (see Supplementary Information for details). Hotspots are short regions (typically spanning about 2 kb) over which recombination rates rise dramatically over local background rates. Whereas the average recombination rate over 500 kb across the human genome is about 0.5 cM48, the estimated recombination rate across the 500-kb ENCODE regions varied nearly tenfold, from a minimum of 0.19 cM (ENm013.7q21.13) to a maximum of 1.25 cM (ENr232.9q34.11). Even this tenfold variation obscures much more dramatic variation over a finer scale: 88 hotspots of recombination were identified (Fig. 8; see also Supplementary Fig. 7)—that is, one per 57 kb—with hotspots detected in each of the ten regions (from 4 in 12q12 to 14 in 2q37.1). Across the 5 Mb, we estimate that about 80% of all recombination has taken place in about 15% of the sequence (Fig. 9, see also refs 46, 49). A block-like structure of human LD. With most human recombination occurring in recombination hotspots, the breakdown of LD is often discontinuous. A ‘block-like’ structure of LD is visually apparent in Fig. 8 and Supplementary Fig. 7: segments of consistently high D0 that break down where high recombination rates, recombination hotspots and obligate recombination events50 all cluster. When haplotype blocks are more formally defined in the ENCODE data (using a method based on a composite of local D0 Figure 6 | Comparison of allele frequencies in the ENCODE data for all pairs of analysis panels and between the CHB and JPT sample sets. For each polymorphic SNP we identified the minor allele across all panels (a–d) and then calculated the frequency of this allele in each analysis panel/sample set. The colour in each bin represents the number of SNPs that display each given set of allele frequencies. The purple regions show that very few SNPs are common in one panel but rare in another. The red regions show that there are many SNPs that have similar low frequencies in each pair of analysis panels/sample sets. NATURE|Vol 437|27 October 2005 ARTICLES 1305
ARTICLES NATUREIVol 437 27 October 2005 34876.000234,879000234,882000234,885000 SNP position Mamala GTC TCAACTGTGTGAGCGAAGGGCCCCCAT GTTACACTCGGCGGTGGGAGCTTAGGAACCCCATGC GTCACACTCGGCGGTGGGAGCTTAGGAACCCCATGC TCCACGCGAGACTACTTAGTTTTCAAGCCT TCACGG CTACTTAGGTTTCAAGCCTTGTCGG TCCACGCGAGACTACTTAGGT TTCAAGCGTTGTCGG ○oooo③ Figure 7 I Genealogical relationships among haplotypes and r values in a binary representation of the same data, with coloured circles at SNP region without obligate recombination events. The region of chromosome positions where a haplotype has the less common allele at that 2(234,876,004-234884481 bp; NCBI build34) within ENr131.2q37 of SNPs all captured by a single tag SNP (with r-20.8)using ontains 36 SNPs, with zero obligate recombination events in the CEU tagging algorithm 4 have the same colour. Seven tag SNPs cor samples. The left part of the plot shows the seven different haplotypes to the seven different colours capture all the SNPs in this region. observed over this region(alleles are indicated only at SNPs), with their respective counts in the data. Underneath each of these haplotypes for the data in this region. values", or another based on the four gamete test), most of the unique haplotypes with frequency more than 0.05 across the 269 Fence falls into long segments of strong LD that contain many individuals in the phased data, and compared them to the fine-scale Ps and yet display limited haplotype diversity (Table 5) recombination map. Figure 10 shows a region of chromosome 19 Specifically, addressing concerns that blocks might be an artefact over which many such haplotypes break at identified recombination of low marker density, in these nearly complete data most of the hotspots, but others continue. Thus, the tendency towards co sequence falls into blocks of four or more SNPs(67% in YRI to 87% localization of recombination sites does not imply that all haplotypes in CEU) and the average sizes of such blocks are similar to initial break at each recombination site. estimates". Although the average block spans many SNPs(30-70), Some regions display remarkably extended haplotype structure he average number of common haplotypes in each block ranged based on a lack of recombination( Supplementary Fig. 8a, b). Most only from 4.0(CHB+ JPT) to 5.6(YRI), with nearly all haplotypes striking, if unsurprising, are centromeric regions, which lack recom in each block matching one of these few common haplotypes. These bination: haplotypes defined by more than 100 SNPs span several results confirm the generality of inferences drawn from disease- megabases across the centromeres. The X chromosome has multiple mapping studies" and genomic surveys with smaller sample sizes regions with very extensive haplotypes, whereas other chromosomes and less complete data typically have a few such domains. ong-range haplotypes and local patterns of recombination. Most global measures of LD become more consistent when Although haplotypes often break at recombination hotspots(and measured in genetic rather than physical distance. For example, block boundaries), this tendency is not invariant. We identified all when plotted against physical distance, the extent of pairwise LI Table 5 I Haplotype blocks in ENCODE regions, according to two methods CHB+JPT Average number of SNPs per block 30.3 544 Average length per block (kb) Fraction of genome spanned by blocks(% Average number of haplotypes (MAF 2 0.05) per block 01 Fraction of chromosomes due to haplotypes with MAF 20.05(%) Method based on the four gamete tests Average number of SNPs per block 24.3 Average length per block(kb) Average number of haplotypes (MAF 2 0.05) per block 5.12 3.63 Fraction of chromosomes due to haplotypes with MAF 2 0.05(%) 2005 Nature Publishing Group
© 2005 Nature Publishing Group values30, or another based on the four gamete test51), most of the sequence falls into long segments of strong LD that contain many SNPs and yet display limited haplotype diversity (Table 5). Specifically, addressing concerns that blocks might be an artefact of low marker density52, in these nearly complete data most of the sequence falls into blocks of four or more SNPs (67% in YRI to 87% in CEU) and the average sizes of such blocks are similar to initial estimates30. Although the average block spans many SNPs (30–70), the average number of common haplotypes in each block ranged only from 4.0 (CHB þ JPT) to 5.6 (YRI), with nearly all haplotypes in each block matching one of these few common haplotypes. These results confirm the generality of inferences drawn from diseasemapping studies27 and genomic surveys with smaller sample sizes29 and less complete data30. Long-range haplotypes and local patterns of recombination. Although haplotypes often break at recombination hotspots (and block boundaries), this tendency is not invariant. We identified all unique haplotypes with frequency more than 0.05 across the 269 individuals in the phased data, and compared them to the fine-scale recombination map. Figure 10 shows a region of chromosome 19 over which many such haplotypes break at identified recombination hotspots, but others continue. Thus, the tendency towards colocalization of recombination sites does not imply that all haplotypes break at each recombination site. Some regions display remarkably extended haplotype structure based on a lack of recombination (Supplementary Fig. 8a, b). Most striking, if unsurprising, are centromeric regions, which lack recombination: haplotypes defined by more than 100 SNPs span several megabases across the centromeres. The X chromosome has multiple regions with very extensive haplotypes, whereas other chromosomes typically have a few such domains. Most global measures of LD become more consistent when measured in genetic rather than physical distance. For example, when plotted against physical distance, the extent of pairwise LD Table 5 | Haplotype blocks in ENCODE regions, according to two methods Parameter YRI CEU CHB þ JPT Method based on a composite of local D’ values30 Average number of SNPs per block 30.3 70.1 54.4 Average length per block (kb) 7.3 16.3 13.2 Fraction of genome spanned by blocks (%) 67 87 81 Average number of haplotypes (MAF $ 0.05) per block 5.57 4.66 4.01 Fraction of chromosomes due to haplotypes with MAF $ 0.05 (%) 94 93 95 Method based on the four gamete test51 Average number of SNPs per block 19.9 24.3 24.3 Average length per block (kb) 4.8 5.9 5.9 Fraction of genome spanned by blocks (%) 86 84 84 Average number of haplotypes (MAF $ 0.05) per block 5.12 3.63 3.63 Fraction of chromosomes due to haplotypes with MAF $ 0.05 (%) 91 95 95 Figure 7 | Genealogical relationships among haplotypes and r 2 values in a region without obligate recombination events. The region of chromosome 2 (234,876,004–234,884,481 bp; NCBI build 34) within ENr131.2q37 contains 36 SNPs, with zero obligate recombination events in the CEU samples. The left part of the plot shows the seven different haplotypes observed over this region (alleles are indicated only at SNPs), with their respective counts in the data. Underneath each of these haplotypes is a binary representation of the same data, with coloured circles at SNP positions where a haplotype has the less common allele at that site. Groups of SNPs all captured by a single tag SNP (with r 2 $ 0.8) using a pairwise tagging algorithm53,54 have the same colour. Seven tag SNPs corresponding to the seven different colours capture all the SNPs in this region. On the right these SNPs are mapped to the genealogical tree relating the seven haplotypes for the data in this region. ARTICLES NATURE|Vol 437|27 October 2005 1306
NATUREIVol 437 27 October 2005 ARTICLES varies by chromosome; when plotted against average recombination explain different patterns on the X chromosome: lower SNP density rate on each chromosome (estimated from pedigree-based genetic smaller sample size, restriction of recombination to females and maps)these differences largely disappear(Supplementary Fig. 6). lower effective populatio on sIze Similarly, the distribution of haplotype length across chromosomes is less variable when measured in genetic rather than physical A view of LD focused on the putative causal SNI distance. For example, the median length of haplotypes is 54.4 kb Although genealogy and recombination provide insight into why on chromosome 1 compared to 34.8 kb on chromosome 21. When nearby SNPs are often correlated, it is the redundancies among SNPs measured in genetic distance, however, haplotype length is much that are of central importance for the design and analysis of more similar: 0.104 cM on chromosome I compared to 0. 111 cM on association studies. A truly comprehensive genetic association chromosome 21(Supplementary Fig 9) study must consider all putative causal alleles and test each for its The exception is again the X chromosome, which has more potential role in disease. If a causal variant is not directly tested in the rate(median haplotype length=0.135 cM), Me or recombination disease sample, its effect can nonetheless be indirectly tested if it is extensive haplotype structure after accoun factors could correlated with a SNP or haplotype that has been directly tested. ENr131.2q37.1 ENm014.7q31.33 CEU 2s08卡5E8 Figure 8 Comparison of linkage disequilibrium an ENCODE regions. For each region(ENr131.2q371 and ENm0147 D' plots for the YRI, CEU and CHB-+JPT analysis panels are shown D <I and LOD<2: blue, D= 1 and LOD< 2; pink, D< I and 三二三 LOD 22; red, D= I and LOD 2 2. Below each of these plots is shown the as red triangles 2005 Nature Publishing Group
© 2005 Nature Publishing Group varies by chromosome; when plotted against average recombination rate on each chromosome (estimated from pedigree-based genetic maps) these differences largely disappear (Supplementary Fig. 6). Similarly, the distribution of haplotype length across chromosomes is less variable when measured in genetic rather than physical distance. For example, the median length of haplotypes is 54.4 kb on chromosome 1 compared to 34.8 kb on chromosome 21. When measured in genetic distance, however, haplotype length is much more similar: 0.104 cM on chromosome 1 compared to 0.111 cM on chromosome 21 (Supplementary Fig. 9). The exception is again the X chromosome, which has more extensive haplotype structure after accounting for recombination rate (median haplotype length ¼ 0.135 cM). Multiple factors could explain different patterns on the X chromosome: lower SNP density, smaller sample size, restriction of recombination to females and lower effective population size. A view of LD focused on the putative causal SNP Although genealogy and recombination provide insight into why nearby SNPs are often correlated, it is the redundancies among SNPs that are of central importance for the design and analysis of association studies. A truly comprehensive genetic association study must consider all putative causal alleles and test each for its potential role in disease. If a causal variant is not directly tested in the disease sample, its effect can nonetheless be indirectly tested if it is correlated with a SNP or haplotype that has been directly tested. Figure 8 | Comparison of linkage disequilibrium and recombination for two ENCODE regions. For each region (ENr131.2q37.1 and ENm014.7q31.33), D0 plots for the YRI, CEU and CHBþJPT analysis panels are shown: white, D0 , 1 and LOD , 2; blue, D0 ¼ 1 and LOD , 2; pink, D0 , 1 and LOD $ 2; red, D0 ¼ 1 and LOD $ 2. Below each of these plots is shown the intervals where distinct obligate recombination events must have occurred (blue and green indicate adjacent intervals). Stacked intervals represent regions where there are multiple recombination events in the sample history. The bottom plot shows estimated recombination rates, with hotspots shown as red triangles46. NATURE|Vol 437|27 October 2005 ARTICLES 1307
ARTICLES NATUREIVol 437 27 October 2005 correlation with one or more others. When two variants are perfectly ENr1122p16.3 EN321.8q24.11 orrelated one is exactly equivalent to testing the other; we efer to such collections of SNPs (with pairwise r=1.0 in the ENr13.4q26 ENr123.12q12 HapMap samples)as perfect proxy sets ENm010.7p152ENr213.18q12 Considering only common SNPs (the target of study for the ENm013702113 -All ENCODE HapMap Project)in CEU in the ENCODE data, one in five SNPs ENm014.7q31.33 has 20 or more perfect proxies, and three in five have five or more. In contrast, one in five has no perfect proxies. As expected, perfect proxy sets are smaller in YRI, with twice as many SNPs( two in five) 已0.4 having no perfect proxy, and a quarter as many (5%)having 20 or more(Figs 11 and 12). These patterns are largely consistent across the range of frequencies studied he project, with a trend towards fewer proxies at MAF <0. 10(Fig. 11). Put another way 0.1 the average common SNP in ENCODE is perfectly redundant with three other SNPs in the YRI samples, and nine to ten other SNPs in other sample sets(Fig. 13) Proportion of Of course, to be detected through LD in an association study Figure| The distribution of recombination events over the ENCODE correlation need not be complete between the genotyped SNP and the causal variant. For example, under a multiplicative disease model recombination for the ten encode ns(coloured lines)and combined and a single-locus x test, the sample size required to detect (black line). For each line, SNP intervals are placed in decreasing order of association to an allele scales as 1/r- That is, if the causal SNP has estimated recombination rate" combined across analysis panels, and the an r=0.5 to one tested in the disease study, full power can be cumulative recombination fraction is plotted against the cumulative maintained if the sample size is doubled. proportion of sequence. If recombination rates were constant, each line The number of SNPs showing such substantial but incomplete would lie exactly along the diagonal, and so lines further to the right reveal correlation is much larger. For example, using a looser threshold for the fraction of regions where recombination is more strongly locally concentrated. declaring correlation(r 20.5), the average number of proxies found for a common SNP in CHB+JPT is 43, and the average in YRI is 16(Fig. 12). These partial correlations can be exploited The typical SNP is highly correlated with many of its neighbours. through haplotype analysis to increase power to detect putative The encode data reveal that snps ar to several nearby SNPs, and partially corRelated perfectly correlated causal alleles, as discussed below to many others. Evaluating p erformance of the phase I estimate the We use the term proxy to mean a SNP that shows a strong proportion of all common SNPs captured ase I map, we ,",,""--;: 9q13). Haplotypes are coloured by the number events they span, with red indicating many redundant haplotypes with frequency of at least 5% in the combined sample events and blue few (bars)and genes(black segments) are shown in an example gene-dense 2005 Nature Publishing Group
© 2005 Nature Publishing Group The typical SNP is highly correlated with many of its neighbours. The ENCODE data reveal that SNPs are typically perfectly correlated to several nearby SNPs, and partially correlated to many others. We use the term proxy to mean a SNP that shows a strong correlation with one or more others. When two variants are perfectly correlated, testing one is exactly equivalent to testing the other; we refer to such collections of SNPs (with pairwise r 2 ¼ 1.0 in the HapMap samples) as ‘perfect proxy sets’. Considering only common SNPs (the target of study for the HapMap Project) in CEU in the ENCODE data, one in five SNPs has 20 or more perfect proxies, and three in five have five or more. In contrast, one in five has no perfect proxies. As expected, perfect proxy sets are smaller in YRI, with twice as many SNPs (two in five) having no perfect proxy, and a quarter as many (5%) having 20 or more (Figs 11 and 12). These patterns are largely consistent across the range of frequencies studied by the project, with a trend towards fewer proxies at MAF , 0.10 (Fig. 11). Put another way, the average common SNP in ENCODE is perfectly redundant with three other SNPs in the YRI samples, and nine to ten other SNPs in the other sample sets (Fig. 13). Of course, to be detected through LD in an association study, correlation need not be complete between the genotyped SNP and the causal variant. For example, under a multiplicative disease model and a single-locus x2 test, the sample size required to detect association to an allele scales as 1/r 2 . That is, if the causal SNP has an r 2 ¼ 0.5 to one tested in the disease study, full power can be maintained if the sample size is doubled. The number of SNPs showing such substantial but incomplete correlation is much larger. For example, using a looser threshold for declaring correlation (r 2 $ 0.5), the average number of proxies found for a common SNP in CHBþJPT is 43, and the average in YRI is 16 (Fig. 12). These partial correlations can be exploited through haplotype analysis to increase power to detect putative causal alleles, as discussed below. Evaluating performance of the Phase I map. To estimate the proportion of all common SNPs captured by the Phase I map, we Figure 10 | The relationship among recombination rates, haplotype lengths and gene locations. Recombination rates in cM Mb21 (blue). Nonredundant haplotypes with frequency of at least 5% in the combined sample (bars) and genes (black segments) are shown in an example gene-dense region of chromosome 19 (19q13). Haplotypes are coloured by the number of detectable recombination events they span, with red indicating many events and blue few. Figure 9 | The distribution of recombination events over the ENCODE regions. Proportion of sequence containing a given fraction of all recombination for the ten ENCODE regions (coloured lines) and combined (black line). For each line, SNP intervals are placed in decreasing order of estimated recombination rate46, combined across analysis panels, and the cumulative recombination fraction is plotted against the cumulative proportion of sequence. If recombination rates were constant, each line would lie exactly along the diagonal, and so lines further to the right reveal the fraction of regions where recombination is more strongly locally concentrated. ARTICLES NATURE|Vol 437|27 October 2005 1308