) BMC Genomics BioMed Central Research article Open Access Survey of microsatellite clustering in eight fully sequenced species sheds light on the origin of compound microsatellites Robert Kofler*1,Christian Schlotterer?,Evita Luschutzky3 and Tamas Lelley! Address:'University of Natural Resources and Applied Life Sciences,Department for Agrobiotechnology IFA-Tulln,Institute of Biotechnology in E-mail:Robert Kofler'-robert@koflerorat:Christian Schlotterer.christian schloetterer@vu-wien ac at: Evita Luschutzky-Evita.Luschuetzky@umweltbundesamtat;Tamas Lelley-tamas.lelley@boku.ac.at: 'Corresponding author Published:17 December 2008 Received:7 May 2008 BMC Genomics2008.9:612dot10.1186/1471-2164-9-612 Accepted:17 December 2008 This article is available from:http://www.biomedcentral.com/1471-2164/9/612 2008 Kofler et al:licensee BioMed Central Ltd. This is an Open Access article dist s of the Creative cor nse(http://creativecommons.or/icenses/by2.0) roduction in an Abstract Background:Compound microsatellites are a special variation of microsatellites in which two or more individual microsatellites are found directly adjacent to each other. Until now,such composite microsatellites have not been investigated in a comprehensive manner. Results:Our in silico survey of microsatellite clustering in genomes of Homo sapiens,Maccaca mulatta.Mus musculus.Rattus norvegicus,Ornithorhynchus anatinus,Gallus gallus,Danio rerio and Drosophila aste revealed an u expected high abundance of compound mic About 4-25%of all microsatellites could be categorized as compound microsatellites.Compound microsatellites are approximately 15 times more frequent than expected under the assumption of a random distribution of microsatellites.Interestingly,microsatellites do not only tend to cluster but the adjacent repe types of compo mic telites have very similar otifs:in most case (>90%)these motifs differ only by a single mutation(base substitution or indel).We propose that the majority of the compound microsatellites originates by duplication of imperfections in a microsatellite tract.This process occurs mostly at the end of a microsatellite,leading to a new repeat type and a potential microsatellite repeat track. Conclusion:Our findings suggest a more dynamic picture of microsatellite evolution than previously believed.Imperfections within microsatellites might not only cause the "death"of microsatellites they might also result in their"birth". I Background attracted much attention during the last decade and Microsatellites or simple sequence repeats (SSR)are notably resulted in various genetic marker systems [4-6]. DNA stretches consisting of a tandemly repeated short DNA motif (s 6 bp).Due to the special mutation mechanism of microsatellites terme edDNA replication According to Chambers et al.7]the following categories of micr osatellites can be dist nguished:Pure,Inter slippage",these sequences often exhibit length hyper- pure,Compound,Interrupted compound,Complex anc variability with respect to the number of motifs being Interrupted complex.In this survey we mainly refer to repeated reviews:[1-3]l.Owing to this hypervariability Compound and Interrupted compound microsatellites. and an ubiquitous presence in genomes,microsatellites This has to be distinguished from the term microsatellite Page 1 of 14 (page number not for citation purposes)
BMC Genomics Research article Survey of microsatellite clustering in eight fully sequenced species sheds light on the origin of compound microsatellites Robert Kofler*1 , Christian Schlötterer2 , Evita Luschützky3 and Tamas Lelley1 Address: 1 University of Natural Resources and Applied Life Sciences, Department for Agrobiotechnology IFA-Tulln, Institute of Biotechnology in Plant Production, Konrad Lorenz Straße 20, 3430 Tulln, Austria, 2 Institut für Popluationsgenetik, Veterinärmedizinische Universitat Wien, Josef Baumann Gasse 1, 1210 Wien, Austria and 3 Umweltbundesamt, Spittelauer Lände 5, 1090 Wien, Austria E-mail: Robert Kofler* - robert@kofler.or.at; Christian Schlötterer - christian.schloetterer@vu-wien.ac.at; Evita Luschützky - Evita.Luschuetzky@umweltbundesamt.at; Tamas Lelley - tamas.lelley@boku.ac.at; *Corresponding author Published: 17 December 2008 Received: 7 May 2008 BMC Genomics 2008, 9:612 doi: 10.1186/1471-2164-9-612 Accepted: 17 December 2008 This article is available from: http://www.biomedcentral.com/1471-2164/9/612 © 2008 Kofler et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Compound microsatellites are a special variation of microsatellites in which two or more individual microsatellites are found directly adjacent to each other. Until now, such composite microsatellites have not been investigated in a comprehensive manner. Results: Our in silico survey of microsatellite clustering in genomes of Homo sapiens, Maccaca mulatta, Mus musculus, Rattus norvegicus, Ornithorhynchus anatinus, Gallus gallus, Danio rerio and Drosophila melanogaster revealed an unexpected high abundance of compound microsatellites. About 4 – 25% of all microsatellites could be categorized as compound microsatellites. Compound microsatellites are approximately 15 times more frequent than expected under the assumption of a random distribution of microsatellites. Interestingly, microsatellites do not only tend to cluster but the adjacent repeat types of compound microsatellites have very similar motifs: in most cases (>90%) these motifs differ only by a single mutation (base substitution or indel). We propose that the majority of the compound microsatellites originates by duplication of imperfections in a microsatellite tract. This process occurs mostly at the end of a microsatellite, leading to a new repeat type and a potential microsatellite repeat track. Conclusion: Our findings suggest a more dynamic picture of microsatellite evolution than previously believed. Imperfections within microsatellites might not only cause the "death" of microsatellites they might also result in their "birth". 1 Background Microsatellites or simple sequence repeats (SSR) are DNA stretches consisting of a tandemly repeated short DNA motif (≤ 6 bp). Due to the special mutation mechanism of microsatellites termed "DNA replication slippage", these sequences often exhibit length hypervariability with respect to the number of motifs being repeated [reviews: [1-3]]. Owing to this hypervariability and an ubiquitous presence in genomes, microsatellites attracted much attention during the last decade and notably resulted in various genetic marker systems [4-6]. According to Chambers et al. [7] the following categories of microsatellites can be distinguished: Pure, Interrupted pure, Compound, Interrupted compound, Complex and Interrupted complex. In this survey we mainly refer to Compound and Interrupted compound microsatellites. This has to be distinguished from the term microsatellite Page 1 of 14 (page number not for citation purposes) BioMed Central Open Access
BMC Genomics 2008,9:612 http://www.biomedcentral.com/1471-2164/9/612 custer as used by grover and Sharma isl which refers to standardized:see Additional file 11.All microsatellite rich regions.However,although microsatel bp (se satellites have bee pends on the distance sep the reason for much depate.Imperfections in them microsatellites.In this work,microsatellites being tellite tract If they accumulate in a microsatellite tact.they have eve termed individual microsatellites being part of such a m tellites cSs the impac (CSSR-%)with a given d ranging the satellites have a composite motif.Despite their abundance species yet been studied in a of 50 bp an infle tion point could b wn about thei ent (ng boundary for the ween two panly due to Thy the 2.2 Freque compound microsatellite density in n of bp.Rode d in different arrangements.The AGmic satellite might b mpound microsatellites (Table 1) cated 5'or 3'to th AC lite and eith vhereas I melanogaste and latipes had the lowest on the s ed he me DNA and as t cor tained an exceptionally high cSSR-%in tract of the AC microsatellite.For these reas fou introduced by Kofle cds (able.la mnogaster this proportion was e Here we provide the non-coding sequences (Table 1).The impact of of h setti compoun microsate ully sequence 。 ence cds)the 5 and (Table S2). es can be on the untranslated rately.We analyzed the genomes of 2.3 Distri of compound microsatellites within the andtinus).a bird (Gallus gallus).a fish otsnot homocou Danio rerio)and a insect (Drosophila melanogaster).We within genomes.For example,in H. sapiens and show tha 250 an m are par 1 evolutionary mechanisms leading to the observed high therefore investigated the distribution of compound frequency of compound micrsoatellites. microsatellites along the c 2 Results window si of MbP andst si ofM 2. Distance between microsatellites Page 2 of 14 page number not for citation purposes)
cluster as used by Grover and Sharma [8] which refers to microsatellite rich regions. However, although microsatellites have first been described more than twenty years ago [9], their evolution is still not fully understood [2, 3]. In particular imperfections within microsatellites have been the reason for much debate. Imperfections in the microsatellite tract are thought to interfere with replication slippage by limiting microsatellite size expansion [10-12]. If they accumulate in a microsatellite tract, they have even been proposed to cause the "death" of a microsatellite [13]. The complementary concept, the "birth" of a microsatellite was first introduced by Messier [14]. However, compound microsatellites, i.e. two or more microsatellites being found in close proximity, have been frequently reported in diverse taxa ranging from humans to plants [10, 15-19]. Weber [10] estimated that, about 10% of the human microsatellites have a composite motif. Despite their abundance, compound microsatellites have not yet been studied in a comprehensive manner and very little is known about their origin and evolutionary dynamics. This lack of knowledge about compound microsatellites is partly due to the difficulties involved by their identification using computer aided approaches. The analysis of compound microsatellites is additionally confounded by the fact that two microsatellites can be arranged in several different combinations [16, 20]. For instance, the two microsatellites [AC]n and [AG]m can be found in four different arrangements. The [AG]m microsatellite might be located 5' or 3' to the [AC]n microsatellite and either the poly-TC or the poly-AG tract of the [AG]m microsatellite might be found on the same DNA strand as the poly-AC tract of the [AC]n microsatellite. For these reasons, four different motif standardizations were introduced by Kofler et al. [20] [see also Additional file 1]. Here we provide the first comprehensive survey of compound microsatellites in the fully sequenced genome of eight eukaryotic species. We surveyed the entire genomes as well as the coding sequence (cds) the 5' and the 3' untranslated region (5'-UTR and 3'-UTR) separately. We analyzed the genomes of five mammals (Homo sapiens, Maccaca mulatta, Mus musculus, Rattus norvegicus, Ornithorhynchus anatinus), a bird (Gallus gallus), a fish (Danio rerio) and a insect (Drosophila melanogaster). We show that 4 – 25% of all microsatellites are part of compound microsatellites and discuss the possible evolutionary mechanisms leading to the observed high frequency of compound micrsoatellites. 2 Results 2.1 Distance between microsatellites We define a compound microsatellite as an aggregation of at least two microsatellites with different motifs [partially standardized: see Additional file 1]. All identified microsatellites have a minimum length of 15 bp (see Material and Methods). Whether two or more adjacent microsatellites account as a compound microsatellite depends on the distance separating these microsatellites. In this work, microsatellites being separated by less than a maximum threshold dmax were classified as compound microsatellite. For brevity, we termed individual microsatellites being part of such a compound microsatellite cSSR and the percentage of these microsatellites cSSR-%. We determined the impact of dmax by measuring the proportion of microsatellite which could be classified as compound microsatellites (cSSR-%) with a given dmax (Fig. 1). As expected, the number of compound microsatellites increases with dmax, but the increase is not linear. While we observed species specific differences, the overall pattern is that around a dmax of 50 bp an inflection point could be found, indicating a different behavior (Fig. 1). One difference between cds and whole genome is that for cds an upper boundary for the distance between two microsatellites exists, i.e. the total length of the cds. 2.2 Frequency of compound microsatellites We quantified the compound microsatellite density in the different genomes by setting dmax to 10 bp. Rodents and D. rerio had the highest proportion of microsatellite being classified as compound microsatellites (Table 1) whereas D. melanogaster and O. latipes had the lowest. Interestingly, for coding sequences no major differences were observed between the species (Table 1). Only R. norvegicus contained an exceptionally high cSSR-% in the cds (Table 1). In D. melanogaster this proportion was higher for coding sequences than for genomic sequences, indicating a more pronounced clustering in the cds than in non-coding sequences (Table 1). The impact of different SSR-search settings on the frequency of compound microsatellites can be found in Additional file 2 (Table S2). 2.3 Distribution of compound microsatellites within the genome of H. sapiens The distribution of microsatellites is not homogeneous within genomes. For example, in H. sapiens and M. musculus an increase in microsatellite density toward the ends of the chromosomes was reported (in 2). We therefore investigated the distribution of compound microsatellites along the chromosomes. The SSR and the compound microsatellite densities were calculated with an overlapping sliding window approach using a window size of 5 Mbp and a step size of 1 Mbp. Consistent with previous results, we show that the distribution of microsatellites varies along the chromosomes as well BMC Genomics 2008, 9:612 http://www.biomedcentral.com/1471-2164/9/612 Page 2 of 14 (page number not for citation purposes)
BMC Genomics 2008.9:612 http://www.biomedcentral.com/1471-2164/9/612 whole genome 100 200 100 200 300 Figure 1 e of the cSSR-% as between chromosomes of H.sapiens(Fig2)Generally 2.4 Parameters governing compound microsatellite atellites Nou some chromosome specificpattemn could be detected.While or most chr romosomes the peaks nosome and recon bination 2【ested spondence could besee Also on some chromosomes,the compound microsatellite pattem seems tobe morepronoun M.were used for this analysis. mo ve frequency of ng ch g2) Table I:Frequency of compound microsatellites in the whole genome and in the coding sequence (cds) whole genome coding sequence c2 CSSR3 mds cds mI c2 CSSR3 mds cds 冲m 59792 12984 496 90 253 45 . 133 327 325 237 3769 8s und n :] tta:M.mus.:Mus musculus:R nor:Rattus norvegicus:O.anat.:Ornithorhynchus anotinus:G gal.:Gallus gallus: Page 3 of 14
as between chromosomes of H. sapiens (Fig. 2). Generally, the distribution of compound microsatellites follows very closely the distribution of microsatellites. Nevertheless, some chromosome specific pattern could be detected. While for most chromosomes the peaks in compound microsatellite density follows the microsatellite density, on chromosome15onlya relativelyweak correspondence couldbe seen. Also on some chromosomes, the compound microsatellite pattern seems to bemore pronounced than themicrosatellite pattern (e.g. chromosome 8). Finally, the spacing between the lines indicating the microsatellite and compound microsatellite density differs among the chromosomes of H. sapiens, suggesting that the relative frequency of compound microsatellites differs among chromosomes (Fig. 2). 2.4 Parameters governing compound microsatellite density Differences in compound microsatellite density can be caused by the parameters 'SSR density', 'species', 'chromosome' and 'recombination'. We tested which of these parameters has a significant influence on compound microsatellite density. Due to the scarcity of species with sequenced Y-chromosomes only H. sapiens, Pan troglodytes and M. musculus were used for this analysis. We observed that the parameters 'SSR-density' (CatReg: p < 0.001), 'species' (CatReg: p < 0.001) and 'chromosome' dmax cSSR - percentage [%] cSSR - percentage [%] dmax 0 100 200 300 400 0 10 20 30 40 50 60 H. sapiens M. mulatta M. musculus R. norvegicus O. anatinus G. gallus D. rerio D. melanogaster 0 100 200 300 400 0 10 20 30 40 50 60 whole genome cds Figure 1 Influence of dmax to the cSSR-%. Table 1: Frequency of compound microsatellites in the whole genome and in the coding sequence (cds). whole genome coding sequence species m.1 c.2 cSSR3 %4 m.d.5 c.d.6 m.1 c.2 cSSR3 %4 m.d.5 c.d.6 H. sap. 1 169 530 59 792 129 848 11.1 413.0 21.1 4 965 104 233 4.7 77.4 1.6 M. mul. 1 178 381 61 407 134 455 11.4 445.3 23.2 3 638 64 139 3.8 71.3 1.3 M. mus. 1 574 180 173 535 398 361 25.3 617.9 68.1 3 995 95 202 5.1 72.5 1.7 R. nor. 1 307 474 133 120 291 304 22.3 527.8 53.7 1 883 92 226 12.0 92.6 4.5 O. anat. 133 984 1 913 3 969 3.0 327.2 4.7 1 535 16 34 2.2 42.8 0.5 G. gal. 233 896 8 532 17 989 7.7 237.5 8.7 1 889 36 77 4.1 58.3 1.1 D. rerio 1 048 258 94 159 225 069 21.5 688.1 61.8 3 215 86 180 5.6 72.0 1.9 D. mel. 44 600 714 1 457 3.3 376.9 6.0 4 168 105 213 5.1 145.6 3.7 1 total number of microsatellites in DNA sequence space 2 total number of compound microsatellites in DNA sequence space 3 number of individual microsatellites being part of a compound microsatellite 4 percentage of individual microsatellites being part of a compound microsatellite (cSSR-%) 5 microsatellite density [m./Mbp] 6 compound microsatellite density [c./Mbp] H. sap.: Homo sapiens; M. mul.: Macaca mulatta; M. mus.: Mus musculus; R. nor.: Rattus norvegicus; O. anat.: Ornithorhynchus anatinus; G. gal.: Gallus gallus; D. rerio: Danio rerio; D. mel.: BMC Genomics 2008, 9:612 http://www.biomedcentral.com/1471-2164/9/612 Page 3 of 14 (page number not for citation purposes)
BMC Genomics 2008,9:612 http://www.biomedcentral.com/1471-2164/9/612 h Chr.3 Chr.4 hr. 39 Chr 10 Chr.12 Chr.14 Chr 15 Chr 16 17 Chr.19 Chr.X Chr.Y compound SSR density [#/Mbp SSR density [#/Mbp] not sequenced (poly-N tracts) Figure 2 Compound microsatellite density in the chromosomes of .sapiens compared to the microsatellite density on the calculated with an sliding window approach using a window size of 5 Mbp and a step size of I Mbp. Page 4 of 14 (page number not for citation purposes)
Chr. 1 Chr. 2 Chr. 3 Chr. 5 Chr. 7 Chr. 10 Chr. 13 Chr. 17 Chr. X Chr. Y Chr. 18 Chr. 20 Chr. 21 Chr. 22 Chr. 14 Chr. 15 Chr. 16 Chr. 11 Chr. 12 Chr. 8 Chr. 9 Chr. 6 Chr. 4 Chr. 19 700 0 350 40 10 25 40 10 25 40 10 25 40 10 25 40 10 25 40 10 25 40 10 25 40 10 25 700 0 350 700 0 350 700 0 350 700 0 350 700 0 350 700 0 350 700 0 350 not sequenced (poly-N tracts) SSR density [#/Mbp] compound SSR density [#/Mbp] Figure 2 Compound microsatellite density in the chromosomes of H. sapiens compared to the microsatellite density. Regions which have not yet been sequenced are designated yellow. The scale of the compound microsatellite density is on the left hand side and the scale of the SSR density on the right hand side. The SSR and the compound microsatellite density were calculated with an sliding window approach using a window size of 5 Mbp and a step size of 1 Mbp. BMC Genomics 2008, 9:612 http://www.biomedcentral.com/1471-2164/9/612 Page 4 of 14 (page number not for citation purposes)
BMC Genomics 2008.9:612 http://www.biomedcentral.com/1471-2164/9/612 ever,very large compound microsatellites,containing few exceptions the cds contains more than four SSRs SSRs Isee Additional file 2 Table 571.To test whethe analysis and correlated the density of microsatellites tha compound microsatellites originate from a nesting of oud ot be dassi ficd ascom C 0.001).chromosome(CatReg:0.001)and 'SSR compoundte)have a significantinen tellite density. To dete mine the Additional file 2:Table S11.which suggests that most influence of recc om ination,we C7emigfeomPondmosaiditsdonotoignate encesin recombination rate and found o significan 2.6 Aggregation of microsatellites bination map published by Kong et al.[21]and compared the recombination frequencies with chance,we deter the respect to an assumed random distibution of micro satellites in the genome. Isee Additional file 3 and Additional file 41 2.5 Compound microsatellite complexity oncept of SSR-couples.SSR-couples ple,the compound microsatellite [AC][AGlo contain compound microsatellite.For example a tri-SSR com- lauter ui-ssR co microsatellites(=87%)contain only two cSSRs(Table 2). sidered [partially standardized:see Additional file 1]. Table 2:Compound micro complexity in the whole genome and in the cds. Ce. 6 25 099 4 Z0420000 0 00040 All values are in count Page 5 of 14
(CatReg: p < 0.001) have a highly significant influence on the compound microsatellite density. These three parameters are highly correlated with the compound microsatellite density (CatReg: R2 = 0.94). Additionally, the relative contributions (rc) of these parameters to the regression could be identified. We found that 'species' (rc = 0.36) and 'chromosome' (rc = 0.38) have the strongest influence and that SSR density has a moderate influence (rc = 0.26). Because compound microsatellites are a subset of the total microsatellite repertoire, we modified our analysis and correlated the density of microsatellites that could not be classified as compound microsatellites with compound microsatellites. Again, 'species' (CatReg: p < 0.001),'chromosome' (CatReg: p < 0.001) and 'SSR density' (CatReg: p < 0.001) have a significant influence on compound microsatellite density and are highly correlated (CatReg: R2 = 0.93) with the compound microsatellite density. To determine the influence of recombination, we compared two groups of chromosomes (Y-chromosomes with chromosomes other than Y) with extreme differences in recombination rate and found no significant influence (CatReg: p = 0.214). To further test the influence of recombination we used the human recombination map published by Kong et al. [21] and compared the recombination frequencies with the compound microsatellite density and found only a very weak correlation (Linear regression: R2 = 0.03) [see Additional file 3 and Additional file 4] 2.5 Compound microsatellite complexity Compound microsatellites might contain different numbers of individual microsatellites (cSSRs). For example, the compound microsatellite [AC]9 [AG]10 contains two whereas the compound microsatellite [AC]11 [AG]7 [AC]9 three cSSRs. We call the former 'di-SSR' and the latter 'tri-SSR' compound microsatellite. Most compound microsatellites (≈ 87%) contain only two cSSRs (Table 2). The number of identified compound microsatellites decreases rapidly with an increasing complexity. However, very large compound microsatellites, containing more than eight cSSRs, can be found in many species (Table 2). We found the largest compound microsatellite in D. rerio chromosome 17, having 40 cSSRs. Only with a few exceptions the cds contains more than four cSSRs (Table 2). The complexity of compound microsatellites in the 5'-UTRs and 3'-UTRs is higher, but rarely exceeds three cSSRs [see Additional file 2: Table S7]. To test whether compound microsatellites originate from a nesting of microsatellites, i.e. secondary microsatellites emerging in the tract of primary microsatellites, we analyzed the percentage of tri-SSR compound microsatellites having the pattern: [m1]n1 [m2]n2 [m1]n3 where m1 and m2 are the motifs of the individual cSSRs [partially standardized: see Additional file 1]. In all eight species about 33% of the tri-SSR compound microsatellites exhibit this pattern [see Additional file 2: Table S11], which suggests that most (67%) tri-SSR compound microsatellites do not originate by a nesting of microsatellites. 2.6 Aggregation of microsatellites To test whether the occurrence of compound microsatellites can be attributed to mere chance, we determined whether microsatellites tend to aggregate with respect to an assumed random distribution of microsatellites in the genome. For simplicity we confine this analysis to pairs of adjacent microsatellites and introduce the technical concept of SSR-couples. SSR-couples are each two adjacent microsatellites being separated by less than 10 bp (dmax), which can be part of a more complex compound microsatellite. For example a tri-SSR compound microsatellite could be viewed as two overlapping SSR-couples. SSR-couples containing two microsatellites with an identical motif were not considered [partially standardized: see Additional file 1]. Table 2: Compound microsatellite complexity in the whole genome and in the cds. whole genome cds c.c.:1 2345678 ≥ 9234 ≥ 5 H. sap. 51 997 6 096 1 198 335 106 41 7 12 81 21 2 0 M. mul. 52 796 6 565 1 389 433 155 49 10 10 53 11 0 0 M. mus. 137 237 26 551 6 561 2 080 652 241 99 114 84 10 1 0 R. nor. 113 077 16 505 2 632 607 170 78 19 32 72 11 5 4 O. anat. 1 791 105 13 4 0 0 0 0 14 2 0 0 G. gal. 7 782 610 115 17 6 2 0 0 32 3 1 0 D. rerio 71 280 15 703 4 163 1 641 592 336 143 301 78 8 0 0 D. mel. 685 29 0 0 0 0 0 0 102 3 0 0 1 compound microsatellite complexity Complexity refers to the number of individual microsatellites constituting the compound microsatellite. All values are in counts BMC Genomics 2008, 9:612 http://www.biomedcentral.com/1471-2164/9/612 Page 5 of 14 (page number not for citation purposes)
BMC Genomics 2008,9:612 http://www.biomedcentral.com/1471-2164/9/612 Table 3 shows that SSr-couples are significantly ove a poly-TG tact on the complementary strand.The ssr omo6吗 me.SSR-couples igni6 stra nce we observed regiona variation in micr aaa ouple motifs in the whole genome of the eight species this analysis in all e and Table 5 shov window (size 5 Mbp) in the tion and the proposed genesis of each SSR-couple. not supp In the who sentation ofssR-coupl in the th AAAG-AAGG (Table 4).Different SSR-couple motifs are different degrees (Table 4).The in the crosa suggests that they have emerged by chance.Most SSR couples, however,are mainly foun in only on e of the with any other microsatellite motif.For simplicity.we this analysis again to SSR-couples.We define SSR (Table 4).Conformati of SSR-couple motf he form m2 has the motif AT-AC [fully standardized:see Additional file 1 frequently found in both conformations (Table 4). the s fo example a [AC microsatellite consists of a poly-AC and proposed to arise from recombination between Table 3:Overrepresentation of SSR- ples in the whole genome and in the cds whole genome cds or.3 obs.! exp.? or? 4 22 42321 50082802 D.mel 743 164 108 4 Page 6 of 14 page number not for citation purposes)
Table 3 shows that SSR-couples are significantly overrepresented in the whole genome (Poisson Distribution: p ≈ 0) as well as in the cds (Poisson Distribution: p < 10-22) of the eight species. Although less abundant than in the entire genome, SSR-couples are significantly overrepresented in the 5'-UTR and 3'-UTR [see Additional file 2: Table S8]. Since we observed regional variation in microsatellite and compound microsatellite densities in all chromosomes (Fig. 2) [see Additional file 5] we conducted this analysis in all eight species separately for each sliding window (size 5 Mbp). We found that the number of observed SSR-couples significantly deviates from the expected number in each sliding window (Poisson Distribution: P < 10-4) [see Additional file 5]. Therefore, our results do not support the hypothesis of a random distribution of microsatellites. Interestingly, the overrepresentation of SSR-couples in the cds is consistently more than twofold higher than in the whole genome (Table 3) whereas it is the lowest in the 5'-UTR and 3'-UTR [see Additional file 2: Table S8]. 2.7 Motifs of compound microsatellites To answer whether there is any motif preference in the composition of compound microsatellites, we examined which microsatellites are most frequently found in close proximity, e.g. whether the microsatellite [AC]n is more frequently associated with the microsatellite [AG]n than with any other microsatellite motif. For simplicity, we confined this analysis again to SSR-couples. We define SSRcouples having the form [m1]n [m2]n as SSR-couples of motif m1–m2, e.g.: the SSR-couple [AT]12 [AC]9 has the motif AT-AC [fully standardized: see Additional file 1]. Additionally we examined the conformation of the SSRcouples. Each microsatellite consists of two tracts, for example a [AC]n microsatellite consists of a poly-AC and a poly-TG tract on the complementary strand. The SSRcouple [AC]8 [AG]9 can be found in two conformations, the poly-AC tract of the [AC]8 microsatellite may either be found on the same or on the complementary DNAstrand as the poly-AG tract of the [AG]9 microsatellite. We call the former plus-conformation and the latter minus-conformation [see Additional file 1]. Table 4 shows the characteristics of the most abundant SSRcouple motifs in the whole genome of the eight species and Table 5 shows equivalent information for the cds. [see Additional file 2: Table S9 in the 5'-UTR, Table S10 in the 3'-UTR]. These tables also contain the conformation and the proposed genesis of each SSR-couple. In the whole genome of all eight species the most abundant SSR-couple motifs are AT-AC, AC-AG and AAAG-AAGG (Table 4). Different SSR-couple motifs are overrepresentated to different degrees (Table 4). The SSR-couple motif AAGG-AGGG, for instance, is 1000- times more abundant than expected by chance. In contrast, SSR-couples containing an [A]n microsatellite usually are only about 40 fold overrepresented. A few SSR-couples have an overrepresentation of ≈ 1, which suggests that they have emerged by chance. Most SSRcouples, however, are mainly found in only one of the two possible conformations (Table 4), i.e they are conformation specific. For example, SSR-couples with the motif AG-AAAG are always in the plus conformation (Table 4). Conformation specificity of SSR-couple motifs suggests that these SSR-couples have not arisen by chance. Only SSR-couples having the motif AC-AG are frequently found in both conformations (Table 4). SSR-couples containing two microsatellites with complementary motifs such as [CTG]13- [CAG]67 have been proposed to arise from recombination between Table 3: Overrepresentation of SSR-couples in the whole genome and in the cds. whole genome cds obs.1 exp.2 or.3 P4 obs.1 exp.2 or.3 P4 H. sap. 69 670 4 488 15 05 129 4 36 05 M. mul. 72 780 4 800 15 05 74 2 30 3E-82 M. mus. 223 973 9 526 23 05 107 3 40 05 R. nor. 157 300 6 639 23 05 134 2 81 05 O. anat. 2 052 399 5 05 18 1 28 6E-22 G. gal. 9 435 512 18 05 41 1 40 9E-52 D. rerio 130 012 7 026 18 05 93 2 42 05 D. mel. 743 164 4 05 108 4 24 05 1 observed number of SSR-couples 2 expected number of SSR-couples with respect to a random distribution of microsatellites within DNA sequence space 3 overrepresentation (obs./exp.) 4 significance of the overrepresentation based on a Poisson Distribution 5 p < 1E - 99 BMC Genomics 2008, 9:612 http://www.biomedcentral.com/1471-2164/9/612 Page 6 of 14 (page number not for citation purposes)
BMC Genomics 2008.9:612 http://www.biomedcentral.com/1471-2164/9/612 Table:Characteristics and probable genesis of the most ant SSR ouples in the whole genom H.sabiens moti obs.T 2 gen. obs.T or2 AT-AC G A-A 2326 00000009 AATG-AGGG AT-AG AIAAAA AG. 开4666810220987 传00000000统 CAG 0600%99006 000000 O.anatinus Ggallus 41719315711 6107176 5。 10930 0600 900000895090 AA D.reri D.me 29 AAC > 320 0 A 3875228 55928981 071000000 CAGO rof SSR-couples having nation (see Text).Values in brackets indicate that only the specified conformation is feasible Hence. have such complementary motifs (Table 4).Instead obvious for SSR-couples with motifs like AAGG-AGGG or ind ing by which illust
homologous microsatellites [22]. Only [AAT]n-[ATT]n (motif: AAT-AAT) SSR-couples in D. rerio and O. anatinus have such complementary motifs (Table 4). Instead, most SSR-couples contain two microsatellites with very similar motifs (Table 4) differing by a single mutation (base substitution or indel) in more than 90% of cases. Hence, only a single mutation would be required for a transformation of one motif into the other. While this is obvious for SSR-couples with motifs like AAGG-AGGG, SSR-couples with motifs like AG-AAAG might require further explanation. The SSR-couple AG-AAAG could in fact also be depicted as AGAG-AAAG, which illustrates Table 4: Characteristics and probable genesis of the most abundant SSR-couples in the whole genome H. sapiens M. mulatta motif obs.1 or.2 %plus3 gen.4 motif obs.1 or.2 %plus3 gen.4 AT-AC 5 975 134 (100) s AAAG-AAGG 5 659 870 100 s AC-AG 5 456 173 28 s AC-AG 5 628 169 31 s AAAG-AAGG 5 149 844 100 s AT-AC 5 205 173 (100) s A-AAAG 4 401 37 100 s A-AAAG 4 481 32 100 s AAGG-AGGG 4 325 2265 100 s AAGG-AGGG 4 456 2311 100 s A-AT 4 234 25 (100) s A-AT 3 505 26 (100) s A-AAAAG 3 263 50 100 s A-AAAAG 3 296 42 100 s AT-AG 2 025 133 (100) s AG-AAAG 2 582 222 100 s AG-AAAG 1 750 161 100 s AT-AG 1 618 146 (100) s AAAT-AAAAT 1 106 58 99 s A-AG 1 547 11 95 s M. musculus R. norvegicus AC-AG 38 006 94 48 s AC-AG 42 254 103 50 s AAAG-AAGG 15 941 943 100 s AT-AC 7 963 48 (100) s AT-AC 11 459 69 (100) s AAAG-AAGG 6 248 1000 100 s AAG-AGG 9 439 1983 100 s AAG-AGG 4 662 1962 100 s AAGG-AGGG 8 829 913 100 s AC-ACAG 4 107 50 95 s AG-AAAG 8 350 129 100 s AG-AGGG 3 993 184 100 s AG-AGGG 7 645 206 100 s AG-ACAG 3 372 110 99 s AAAC-AAAAC 3 877 59 100 s AC-CG 3 013 308 (100) s AG-AAGG 3 763 83 100 ? AT-AG 2 654 43 (100) s A-AAAT 3 623 37 98 s AC-ACGC 2 554 168 99 s O. anatinus G. gallus AC-AG 476 267 4 s A-AAAG 530 48 99 s AT-AC 175 111 (100) s AAAC-AAAAC 412 74 100 s AAT-ATC 113 11 14 s AAAG-AAGG 341 1209 100 s AT-AG 79 87 (100) s AT-AC 309 173 (100) s AAT-AATG 76 1 37 c A-AC 293 21 98 s AAT-AAT 71 1 (0) s AAC-AAAC 266 72 99 s AATG-ACTG 65 38 98 s A-AAAC 260 6 95 s AATG-ATCC 37 79 0 s AAGG-AGGG 254 5492 100 s AATC-AATG 31 3 26 c/s A-AAAAG 228 45 99 s AG-AAAG 31 301 100 s A-AAG 223 95 100 s D. rerio D. melanogaster AT-AC 21 990 63 (100) s AAC-AGC 45 53 100 s A-AT 11 172 48 (100) s A-AAT 23 20 57 s ATAG-ACAG 10 370 1516 100 s AT-AC 18 5 (100) s ATAG-ATCC 6 503 497 0 s AT-ATAC 17 25 (100) s AAT-AAT 5 910 38 (0) r/s ATC-AGC 15 29 93 s AT-ATAC 4 587 230 (100) s ACC-AGC 12 42 100 s AC-AG 3 830 49 26 s AAT-AAAT 12 68 100 s AAT-ACT 3 685 316 84 s AGC-AGG 8 29 88 s AAT-AAC 3 624 204 91 s AGC-AACAGC 7 69 100 s AT-AAAT 2 973 17 (100) s AT-AAT 7 11 (100) s 1 observed number of SSR-couples having the given motif 2 overrepresentation 3 percent of the SSR-couples found in the plus-conformation (see Text). Values in brackets indicate that only the specified conformation is feasible (e.g.: SSR-Couples containing self complementary microsatellites) 4 suggested genesis of the SSR-couple: c: chance; r: recombination; s: slippage; ?: unknown BMC Genomics 2008, 9:612 http://www.biomedcentral.com/1471-2164/9/612 Page 7 of 14 (page number not for citation purposes)
BMC Genomics 2008,9:612 http://www.biomedcentral.com/1471-2164/9/612 Table 5:Characteristics and probable genesis of the most abundant SSR-couples in the cds H.sapiens moti obs gen. or2 gen AGC-CCG 480 AAC-AGO 2241 2 CTG 44 2381 18 ACG-AGG 4I9 M AAG-AGG 30 ATCC 22 ACGC-AGC 9 3528 AG-AAGO 43 0. AAC-AGO 426 AAAG-AAGG 郎110 ATC-ACG AGG D.reri AAC-A 29 68 644433 5444 109356467 0700000000 -AG SSR-couples having ext). e.g:SSR ouples self compl how only one base substitution is required to transform conformation,ATAG and ATCC.differ by two base AAAG.In anothe s with the motif ATA -0 and ATAG-ATCC SSR r by a singl individual microsatellite motifs of the plus only found in the conformation which requires the Page 8 of 14 page number not for citation purposes)
how only one base substitution is required to transform the repeat motif AG into the motif AAAG. In another example, SSR-couples with the motif ATAG-ATCC in D. rerio are only found in the minus conformation. The two individual microsatellite motifs of the plus conformation, ATAG and ATCC, differ by two base substitutions, whereas the two motifs of the minus conformation, ATAG and ATGG, only differ by a single base substitution. These ATAG-ATCC SSR-couples are only found in the conformation which requires the Table 5: Characteristics and probable genesis of the most abundant SSR-couples in the cds H. sapiens M. mulatta motif obs.1 or.2 %plus3 gen.4 motif obs.1 or.2 %plus3 gen.4 AGC-CCG 20 74 20 s AAC-AGC 12 2 244 100 s AAC-AGC 18 1 913 100 s AGC-CCG 8 61 25 s AAG-AGG 10 133 100 s AAG-AGG 7 160 100 s AGG-CCG 9 38 22 s AAAG-AAGG 5 > 104 100 s AAG-ATC 6 428 0 s ACC-CCG 4 134 100 - ACC-CCG 5 73 80 s AGC-AGCTCC 3 367 100 - AGCCTGAGGCCC 4 > 104 0 - AGG-AAGAGG 3 508 100 - AGC-AGCCTG 4 2 381 0 - A-AAG 3 122 100 - AGC-AGG 4 12 100 - AGC-AGG 3 15 100 - ACG-AGG 3 419 100 - AGG-CCG 2 19 0 - M. musculus R. norvegicus AAG-AGG 13 210 100 s AACC-ATCC 16 > 104 100 s AAC-AGC 10 751 100 s AT-AC 12 2 473 (100) s AC-AG 7 5 655 43 s AAG-AGG 12 353 100 s CCG-AGCCGG 6 2 937 100 s/? AAAG-AAGG 9 > 104 100 s AGC-AGGCCC 6 732 100 ? AG-AAAG 9 3520 100 s ACC-CCG 5 121 100 s AC-AG 7 481 86 s AAAG-AAGG 5 > 104 100 s CCG-AGCCGG 5 4 828 100 s/? AGC-CCG 4 25 0 - AGG-CCG 4 86 0 - AGG-CCG 3 23 67 - AG-AAGG 4 2 347 100 - AAG-AAAAG 2 1 159 100 - AG-ACAG 4 9 387 100 - O. anatinus G. gallus AAC-AGC 2 4 265 100 - AAAG-AAGG 5 > 104 100 s AGC-AATG 2 262 100 - ACG-AGC 4 1 260 100 - ACG-AGG 2 319 100 - A-AAAG 4 2 605 100 - ACT-AGG 2 3 828 0 - ACC-AGG 3 121 0 - AC-AG 1 1 866 0 - AAG-AGG 3 107 100 - AATG-AAGG 1 3 445 100 - AAGG-AGGG 2 > 104 100 - AGC-ACACC 1 2 843 100 - CCG-CCGCG 2 2 085 100 - AG-AAAG 1 7 464 100 - AGC-CCG 1 21 0 - AAC-ACACC 1 > 104 100 - ACCGC-AGCGG 1 > 104 0 - ATC-ACG 1 1 464 0 - AGC-AGG 1 12 100 - D. rerio D. melanogaster AAC-AGC 12 788 100 s AAC-AGC 36 62 100 s AAT-AAAT 9 4 273 100 s AGC-CCG 8 40 75 s AACC-ATCC 6 > 104 100 s ACC-AGC 7 19 100 s AC-AC 6 41 (0) r/? AGC-AGG 5 13 80 s ATCC-ACGG 6 > 104 0 s AAT-AAC 4 315 100 - ATC-ACG 4 5 622 0 - AAC-ATC 4 140 100 - AAG-ATC 4 113 0 - ATC-AGC 4 24 100 - ATC-AGG 4 58 0 - ACG-AGG 3 240 100 - AAT-ACT 3 9 081 100 - AGC-AACAGC 3 31 100 - ACC-AGC 3 126 0 - AAC-ACC 3 47 100 - 1 observed number of SSR-couples having the given motif 2 overrepresentation 3 percent of the SSR-couples found in the plus-conformation (see Text). Values in brackets indicate that only the specified conformation is feasible (e.g.: SSR-Couples containing self complementary microsatellites) 4 suggested genesis of the SSR-couple: c: chance; r: recombination; s: slippage; ?: unknown BMC Genomics 2008, 9:612 http://www.biomedcentral.com/1471-2164/9/612 Page 8 of 14 (page number not for citation purposes)
BMC Genomics 2008.9:612 http://www.biomedcentral.com/1471-2164/9/612 other couples with the motif AC-AG provide intere f compou Sin the whole into the 3).whic highly significa minus imilar results in an analysis of I3 Mbp of th substitution (plus:AC AG;minus:AC TC) D.melanogaster genome that microsatellites tend to nterestingly, can be with found in a that almost all SSR-couples contair two cSSRs with Interestingly,despite their rare occurrence,compound se mot m05 in the cds the other motif.This suggests that most of the cSSRs erved because of an involvement ncellular processes.A recent review by Kashi and King 24 might be din the mpound mi 3 Discussion influences social behaviour in voles.In the cds however of compound omes.The most influential parameter on the number of ance between twe would be expected.Never theless,we observed that it is more and the overall 'SSR-density.These three parameters are d with compoun microsatellite density eve and always identifying the 38%and 35%of the observed variation in compound lite dens Therefore,the choice ofd should aim to allow a ch and at the influence of the species,since these processes has been an for this uncertainty by allowing for mismatches in the SSR-search and by using a d of 10 bp. The significant differences in compound microsatellite uld mes 0) 3.I Microsatellite clusters:frequency and ge ral features processes which might be responsible for this To our knowledge,the only estimate of compoun differences. who estim of all eber microsatellites have a compound motif.Given the 3.2 Genesis of compound microsatellites:Recombinatior Jakupciak and Wells [22]showed that 'illegitimate ecombinatio microsatellites consisting of two microsatellites with are part of of the satellites gh the e home Wells 22]and further assuming that illegitime recom Page 9 of 14
fewest base substitution to transform one motif into the other, i.e the minus conformation. In particular SSRcouples with the motif AC-AG provide interesting insight into the origin of compound microsatellites. Since individual microsatellite motifs of the plus and the minus conformation only differ by a single base substitution (plus: AC ⇌ AG; minus: AC ⇌ TC). Interestingly, both conformations can be found in all examined species with relativly equal frequencies (balanced conformation, Table 4). Overall, we found that almost all SSR-couples contain two cSSRs with highly similar motifs. These motifs will typically require only a single base substitution for transformation into the other motif. This suggests that most of the cSSRs forming a compound microsatellite are derived from a preexisting microsatellite. 3 Discussion We present the first comprehensive survey of compound microsatellites in eight fully sequenced eukaryote genomes. The most influential parameter on the number of identified compound microsatellites is the maximum distance between two adjacent microsatellites. If microsatellites were randomly distributed, a linear increase of cSSR frequency with dmax would be expected. Nevertheless, we observed that it is more likely to have two microsatellites in close proximity. We note, however, that defining the optimal dmax is somewhat complicated for microsatellites carrying imperfections. Due to partially incomplete SSR-search, not always identifying the whole microsatellite tract, neighboring microsatellites might not be recognized as a compound microsatellite. Therefore, the choice of dmax should aim to allow a certain degree of inaccuracy in the SSR-search and at the same time provide the maximum sensitivity for the identification of compound micrsosatellites. We account for this uncertainty by allowing for mismatches in the SSR-search and by using a dmax of 10 bp. 3.1 Microsatellite clusters: frequency and general features To our knowledge, the only estimate of compound microsatellites frequency was published by Weber [10] who estimated that about 10% of all H. sapiens microsatellites have a compound motif. Given the limited amount of sequence information available at that time, this estimate corresponds remarkably well with our results based on the complete genome. In H. sapiens, about 11% of all microsatellites are part of a compound microsatellite (Table 1). The large majority of these compound microsatellites is located in intergenic regions. The distribution of compound microsatellites in H. sapiens is fairly homogeneous throughout all chromosomes, i.e. no clustering at the telomeres and around the centromeres could be observed (Fig. 2). Compound microsatellites are 4 – 23 fold overrepresented in the whole genomes of eight fully sequenced species (Table 3), which is highly significant (Poisson Distribution: P < 0.001). Bachtrog et al. [23] reported similar results in an analysis of 13 Mbp of the D. melanogaster genome that microsatellites tend to aggregate and significantly deviate from a random distribution within the investigated sequence. Interestingly, despite their rare occurrence, compound microsatellites are most overrepresented in the cds (Table 3) which may indicate that these compound microsatellites are conserved because of an involvement in cellular processes. A recent review by Kashi and King [24] for example suggested that compound microsatellites might be involved in the regulation of avpr1a which influences social behaviour in voles. In the cds however, most SSR-couples contain microsatellites having motifs of length three or six base pairs (Table 5). This is not surprising, as these microsatellites do not cause a shift in the reading frame in case of a slippage event [25]. Three main parameters governing compound microsatellite density can be identified: 'species', 'chromosome' and the overall 'SSR-density'. These three parameters are highly correlated with compound microsatellite density (R2 = 0.94). The parameters with the most significant influence are 'chromosome and 'species', accounting for 38% and 35% of the observed variation in compound microsatellite density, respectively. We hypothesize that the rate of base substitutions and the efficiency of the mismatch repair system are responsible for the high influence of the species, since these processes has been identified as to be crucial for the evolution and stability of microsatellites in general [1-3]. The significant differences in compound microsatellite density between chromosomes (CatReg: p < 0.001) were not expected, we could only speculate about the processes which might be responsible for this differences. 3.2 Genesis of compound microsatellites: Recombination Jakupciak and Wells [22] showed that 'illegitimate' recombination involving an inversion between two homologous microsatellites may create compound microsatellites consisting of two microsatellites with self complementary motifs such as [CTG]13 [CAG]67. Assuming that compound microsatellites predominately originate through the process described by Jakupciak and Wells [22] and further assuming that 'illegitime' recombination rates are positively correlated with normal recombination rates, the Y chromosomes ought to have BMC Genomics 2008, 9:612 http://www.biomedcentral.com/1471-2164/9/612 Page 9 of 14 (page number not for citation purposes)
BMC Genomics 2008,9:612 http://www.biomedcentral.com/1471-2164/9/612 significantly less compound micros tellites than the significantly correlated with the compound microsatel tosomes. This was not confirmed by our results that recombinatior does ave e and com n. satellite einaeamo they (shoul ove ing a high SSR density microsatellites in the genomes.(they should only be reau of adjacent SSRs due to chance.Third imperfections in the tract of mi sate wo mi d hav gin of nypothesis that imperfections within microsatellites abbr.Table 4 demonstrates that only very few sS ore we plain on and most ssr-couples (and thus compound mic satellites may generate ound microsatellites have lites)are created by processes other than recombination. alre ed 7,28 Basically,mutation mot 3.3Ge nd mic tion of SSR-couples age 127-29 thus generating a 'prot The ly a minor fra compound This mic cidental em tellite in the ximity of repeats are already sufficient sting one.SSR -couples formed by chanc sh show a dist ctive pater ndependent expa of the microsate conformation (50%plus and 50%minus confora tion slippage of the tion)and (iii,iv)the motifs of the individual micr ompoayicroatebe atellite.However replic ilar torming the ouples need abbr.c).A high overrepresentation and an unbalanced motif of the primary in which and the secondary conformation are strong indications that the respective ct o ngth diff erenc eg AC- if AAT AATC in O.exhibit both a low ove presentation and a the duplication of imperfect mouif repeats should have over atively balanced conformation. Therefore our results distinctive pattern:(i)they should be e highly the SSR-coupe rese on in the proximity of an already existing one ound microsatellite:(these SSR- should mos tly be found in one confo n,either plu or minus; )the m prim 3.4 Genesis of compound microsatellites:Imperfection ithe within mic tellites in a stepwise manner:and (iv)the motifs of the primary We found that the graphs of the microsatellite and and the secondar microsatellite should be similar .2 Table 6:Overview of the recognition pattern of different mechanism potentially ge erating SSR-couples proposed origin overrepresentation conformation motif length motif similarity chance(c) balar non on (r) pge Page 10 of 14 (page number not for citation purposes)
significantly less compound microsatellites than the autosomes. This was not confirmed by our results, which suggest that recombination does not have a significant influence on compound microsatellite density (CatReg: p = 0.214 and Linear Correlation: R2 = 0.03). Moreover SSR-couples created by recombination will exhibit a distinctive pattern: they (i) should be overrepresented compared to a random distribution of microsatellites in the genomes, (ii) they should only be found in the minus-conformation, (iii) the motifs of the two microsatellites forming a SSR-couple should have identical length (e.g.: AC-AG), (iv) and these two motifs should be mutually complementary (summary in Table 6; abbr.: 'r'). Table 4 demonstrates that only very few SSRcouples show this pattern, therefore we suggest that SSRcouples formed by 'illegitimate' recombination are rare and most SSR-couples (and thus compound microsatellites) are created by processes other than recombination. 3.3 Genesis of compound microsatellites: Random events The highly significant overrepresentation of SSR-couples (Table 3) indicates that only a minor fraction of the compound microsatellites can be attributed to a coincidental emergence of a microsatellite in the proximity of an already existing one. SSR-couples formed by chance should also show a distinctive pattern: (i) they should not be overrepresented, (ii) they should have a balanced conformation (e.g. 50% plus and 50% minus conformation) and (iii, iv) the motifs of the individual microsatellite forming these SSR-couples need not to be similar in length and sequence (summary in Table 6; abbr.: 'c'). A high overrepresentation and an unbalanced conformation are strong indications that the respective SSR-couples are not a product of chance. Table 4 shows that only the SSR-couples having the motif AAT-AATG in O. anatinus exhibit both a low overrepresentation and a relatively balanced conformation. Therefore our results suggest that the majority of the SSR-couples can not be attributed to a coincidental emergence of a microsatellite in the proximity of an already existing one 3.4 Genesis of compound microsatellites: Imperfections within microsatellites We found that the graphs of the microsatellite and compound microsatellite density have a highly similar overall shape (Fig. 2) and that the SSR-density is significantly correlated with the compound microsatellite density (CatReg: p < 0.001). Three scenarios for this high interdependence between microsatellite and compound microsatellite density are in theory possible. First, recombination between homologous microsatellites might lead to elevated compound microsatellite densities in genomic regions having a high SSR density. Second, an increased SSR density might increase the frequency of adjacent SSRs due to chance. Third, imperfections in the tract of microsatellites may be the origin of compound microsatellites [26-29]. Since we already excluded the first two scenarios only the hypothesis that imperfections within microsatellites may give rise to compound microsatellites remains as the most probable explanation. Possible molecular mechanism explaining how imperfections within microsatellites may generate compound microsatellites have already been discussed [27, 28]. Basically, mutations within a microsatellites generate an imperfect motif repeat which may be duplicated tandemly due to replication slippage [27-29], thus generating a 'proto' compound microsatellites. This 'proto' compound microsatellites consist of a long and a short microsatellite which may have as few as two adjacent repeat units. Two motif repeats are already sufficient for independent expansion of the microsatellite by replication slippage or indel-like events [30, 31]. After adequate expansion of the short microsatellite, the primary combined with the secondary microsatellites will be regarded as compound microsatellite. However, replication slippage events involving the imperfect motif repeat may also span several motif repeats in which case the motif of the primary and the secondary microsatellite will have a stepwise length difference (e.g.: AC-AGAC, AC-AGACAC, A-AAAG). The SSR-couples generated by the duplication of imperfect motif repeats should have a distinctive pattern: (i) they should be highly overrepresented since a single mutation, followed by a slippage event is sufficient for the formation of the proto compound microsatellite; (ii) these SSR-couples should mostly be found in one conformation, either plus or minus; (iii) the motif length of the primary and the secondary microsatellite should either be equal or differ in a stepwise manner; and (iv) the motifs of the primary and the secondary microsatellite should be similar, mostly differing only by a single mutation (iv) (summary in Table 6; abbr.: 's'). The majority of the SSRTable 6: Overview of the recognition pattern of different mechanism potentially generating SSR-couples proposed origin overrepresentation conformation motif length motif similarity chance (c) none (low) balanced none required none required recombination (r) medium unbalanced – minus equal reverse complement slippage (s) high unbalanced equal (stepwise equal) high BMC Genomics 2008, 9:612 http://www.biomedcentral.com/1471-2164/9/612 Page 10 of 14 (page number not for citation purposes)