ww.nature.com/scientificreports/ Sequence analysis pipeline. A total of 41, 123, 342 paired-end reads were obtained and their sequencing quality assessed by FastQC. The majority of the sequence processing was performed based on the Schloss MiSeq standard operating procedure using Mothur. Briefly, paired sequences were merged into contigs followed by exclusion of sequences that had ambiguous base pairs, homopolymers longer than eight, or a read length greate than 275bp Remaining sequences were trimmed using a 10bp sliding window with an average quality score of 28 and a minimal trimmed length of 200bp Remaining high quality sequences were aligned to SILVA v119 . PCR errors were reduced using pre cluster, and chimeras were detected and removed by UCHIME. Subsequently, the high quality sequences were classified using Mothur's implementation of the RDP classifier against the reference ning set v14. Sequences classified as Mitochondria, Chloroplasts, Archaea, Eukaryotes, or unclassified at the kingdom level were excluded. Remaining sequences were clustered at a distance of 0.03(97% sequence similarity) using average neighbor clustering algorithm. After classifying the consensus taxonomy for each OTU, singletons were removed. To account for the presence of contaminating microbial 16S rDNA sequences in extraction kits d laboratory reagents, we removed all OTUs detected from control empty Eppendorf tubes. A final count of 15 NAF samples, along with 13 nipple skin, and 12 post-Betadine skin samples survived strict quality processing and sequent background subtraction. The 15 NAF samples, nine from HC and six from BC, had a total of 31,968 eads distributed among 143 OTUs with a table density of 0. 102. The 13 skin samples, eight from HC and five from BC, had a total of 31, 679 reads distributed among 127 OTUs with a table density of 0.108. The post-Betadine skin samples, five from HC and seven from BC, had a total of 26, 172 reads distributed among 119 OTUs with a table density of 0.114. Clinical information of the 12 patients whose NAF samples passed quality filtering and background subtraction is shown in Table 2. The table for the number of NAF samples attempted, collected, amplified, and passed quality filtering across clinical information are shown in Supplemental Table S6 Statistical analysis. To account for any bias caused by uneven sequencing depth, the least number of sequences present in any given sample from a sample category was selected randomly prior to calculating community-wide dissimilarity measures(rarefaction). 918 random sequences were chosen for the nipple skin analysis, comparing eight skin samples from HC vS. five from BC. 1164 random sequences were chosen for the analysis of the surgical scrubbed (Betadine)nipple skin microbiome, which compared five post-Betadine skin swabs from HC vS. seven from BC. 934 random sequences were chosen for NAF analysis, comparing nine NAF from HC vs six from BC. Finally, 918 random sequences were chosen for the comparison between the nipple skin and the NAF samples, which had nine pairs collected from the same breasts: six from HC and three from BC. All Principal Coordinate Analyses(PCoA)were based on a Bray-Curtis dissimilarity using evenly sampled (rarefied) OTU abundances. To test for community compositional differences, we used analysis of variance(Adonis)from R's vegan package s, which also noted that Adonis test is less sensitive to dispersion effects and is a more robust alternative to either analysis of similarities(ANOSIM) or multi-response permutation procedures(MRPP). To compare community-wide differences between paired nipple skin and NAF samples, the strata parameter wa applied to restrict permutations within the patient variables and not across. To test for differentially abundant TUs between HC and BC samples, we employed a nonparametric Kruskal-Wallis test. As for comparing the OTU abundances between paired NAF and nipple skin, we instead used a paired wilcoxon signed-rank test. The alpha-level cutoff was 0.05 and false discovery correction was not applied for comparing the OTU relative abundances Observed OTU metric was used for alpha diversity calculations with ten permutations of random sampling at each sequencing depth. Taking the observed OTUs at the final sequencing depth, we performed nonparametric t-test using Monte Carlo permutations to compare the diversity between samples from HC vS BC. As for comparing the bacterial diversity between paired NAF and nipple skin samples, we instead used a paired two-tailed t-test. All statistical analyses were performed in R 3. 2.3 Functional Prediction with PICRUSt. Following the Mothur pipeline, the high quality filtered OTUs were aligned again, but this time against a closed reference greengenes database(v13_8). The analysis is limited to bacteria whose genome, and therefore all of their genes and gene contents, are known. Due to the closed reference OTU picking, only 12 NAF samples, six from HC and six from BC, had sufficient 16S reads remaining. Next, we used PICRUSt Package to computationally predict the gene contents from the 16S rRNA region, and subse quently, the pathways related to these genes". Using PICRUSt, the picked OTUs were normalized to their copy PICRUSt identified a total of 328 KEGG pathways and 6909 bacterial genes, of which we pre-selected 18 pathways to reduce false discovery rate from multiple hypothesis testing: the pathways were selected based on previ nutritional and microbiome studies in relation to colon cancer-. The selected 18 pathways were normalized to the sample with the lowest abundance, which was at a sampling depth of 26359. The predicted KEGG pathways between NAF from HC vS BC were compared using Kruskal-Wallis test with Benjamini-Hochberg correction References 1. Kuper, H, Adami, H. O& Trichopoulos, D Infections as a major preventable cause of human cancer. Journal of Internal Medicine 248,171-183,doi:1 1365-27962000.00742x(2000) el. C. ef al. 3. Kundu, J K& Surh, Y J Inflammation: Gearing the journey to cancer. Mutation Research/Reviews in Mutation Research 659, 15-30, doi:10.1016/ nerev.2008.03.002(2008) clinic.2015105(2015 5. Abreu, M. T. Peek Jr, R. M. Gastrointestinal Malignancy and the Microbiome Gastroenterology 146, 1534-1546, doi: 10. 1053/ gastro201401.001(2014) SCIENTIFIC REPORTS 6: 28061 DO1: 10.1038/srep28061www.nature.com/scientificreports/ Scientific Reports | 6:28061 | DOI: 10.1038/srep28061 9 Sequence analysis pipeline. A total of 41,123,342 paired-end reads were obtained and their sequencing quality assessed by FastQC50. The majority of the sequence processing was performed based on the Schloss MiSeq standard operating procedure using Mothur51. Briefly, paired sequences were merged into contigs followed by exclusion of sequences that had ambiguous base pairs, homopolymers longer than eight, or a read length greater than 275bp. Remaining sequences were trimmed using a 10bp sliding window with an average quality score of 28 and a minimal trimmed length of 200bp. Remaining high quality sequences were aligned to SILVA v11952. PCR errors were reduced using pre.cluster, and chimeras were detected and removed by UCHIME53. Subsequently, the high quality sequences were classified using Mothur’s implementation of the RDP classifier against the reference training set v1454. Sequences classified as Mitochondria, Chloroplasts, Archaea, Eukaryotes, or unclassified at the kingdom level were excluded. Remaining sequences were clustered at a distance of 0.03 (97% sequence similarity) using average neighbor clustering algorithm. After classifying the consensus taxonomy for each OTU, singletons were removed. To account for the presence of contaminating microbial 16S rDNA sequences in extraction kits and laboratory reagents, we removed all OTUs detected from control empty Eppendorf tubes. A final count of 15 NAF samples, along with 13 nipple skin, and 12 post-Betadine skin samples survived strict quality processing and subsequent background subtraction. The 15 NAF samples, nine from HC and six from BC, had a total of 31,968 reads distributed among 143 OTUs with a table density of 0.102. The 13 skin samples, eight from HC and five from BC, had a total of 31,679 reads distributed among 127 OTUs with a table density of 0.108. The post-Betadine skin samples, five from HC and seven from BC, had a total of 26,172 reads distributed among 119 OTUs with a table density of 0.114. Clinical information of the 12 patients whose NAF samples passed quality filtering and background subtraction is shown in Table 2. The table for the number of NAF samples attempted, collected, amplified, and passed quality filtering across clinical information are shown in Supplemental Table S6. Statistical analysis. To account for any bias caused by uneven sequencing depth, the least number of sequences present in any given sample from a sample category was selected randomly prior to calculating community-wide dissimilarity measures (rarefaction). 918 random sequences were chosen for the nipple skin analysis, comparing eight skin samples from HC vs. five from BC. 1164 random sequences were chosen for the analysis of the surgical scrubbed (Betadine) nipple skin microbiome, which compared five post-Betadine skin swabs from HC vs. seven from BC. 934 random sequences were chosen for NAF analysis, comparing nine NAF from HC vs. six from BC. Finally, 918 random sequences were chosen for the comparison between the nipple skin and the NAF samples, which had nine pairs collected from the same breasts: six from HC and three from BC. All Principal Coordinate Analyses (PCoA) were based on a Bray-Curtis dissimilarity using evenly sampled (rarefied) OTU abundances. To test for community compositional differences, we used analysis of variance (Adonis) from R’s vegan package55, which also noted that Adonis test is less sensitive to dispersion effects and is a more robust alternative to either analysis of similarities (ANOSIM) or multi-response permutation procedures (MRPP). To compare community-wide differences between paired nipple skin and NAF samples, the strata parameter was applied to restrict permutations within the patient variables and not across55. To test for differentially abundant OTUs between HC and BC samples, we employed a nonparametric Kruskal-Wallis test. As for comparing the OTU abundances between paired NAF and nipple skin, we instead used a paired Wilcoxon signed-rank test. The alpha-level cutoff was 0.05 and false discovery correction was not applied for comparing the OTU relative abundances. Observed OTU metric was used for alpha diversity calculations with ten permutations of random sampling at each sequencing depth. Taking the observed OTUs at the final sequencing depth, we performed a nonparametric t-test using Monte Carlo permutations to compare the diversity between samples from HC vs. BC. As for comparing the bacterial diversity between paired NAF and nipple skin samples, we instead used a paired two-tailed t-test. All statistical analyses were performed in R 3.2.356. Functional Prediction with PICRUSt. Following the Mothur pipeline, the high quality filtered OTUs were aligned again, but this time against a closed reference Greengenes database (v13_8)57. The analysis is limited to bacteria whose genome, and therefore all of their genes and gene contents, are known. Due to the closed reference OTU picking, only 12 NAF samples, six from HC and six from BC, had sufficient 16S reads remaining. Next, we used PICRUSt package to computationally predict the gene contents from the 16S rRNA region, and subsequently, the pathways related to these genes28. Using PICRUSt, the picked OTUs were normalized to their copy numbers and their pathway abundances inferred from the Kyoto Encyclopedia of Genes and Genomes (KEGG)58. PICRUSt identified a total of 328 KEGG pathways and 6909 bacterial genes, of which we pre-selected 18 pathways to reduce false discovery rate from multiple hypothesis testing; the pathways were selected based on previous nutritional and microbiome studies in relation to colon cancer20–27. The selected 18 pathways were normalized to the sample with the lowest abundance, which was at a sampling depth of 26359. The predicted KEGG pathways between NAF from HC vs. BC were compared using Kruskal-Wallis test with Benjamini-Hochberg correction. References 1. Kuper, H., Adami, H. O. & Trichopoulos, D. Infections as a major preventable cause of human cancer. Journal of Internal Medicine 248, 171–183, doi: 10.1046/j.1365-2796.2000.00742.x (2000). 2. de Martel, C. et al. Global burden of cancers attributable to infections in 2008: a review and synthetic analysis. The Lancet Oncology 13, 607–615, doi: 10.1016/S1470-2045(12)70137-7 (2012). 3. Kundu, J. K. & Surh, Y. J. Inflammation: Gearing the journey to cancer. Mutation Research/Reviews in Mutation Research 659, 15–30, doi: 10.1016/j.mrrev.2008.03.002 (2008). 4. Crusz, S. M. & Balkwill, F. R. Inflammation and cancer: advances and new agents. Nat Rev Clin Oncol 12, 584–596, doi: 10.1038/ nrclinonc.2015.105 (2015). 5. Abreu, M. T. & Peek Jr., R. M. Gastrointestinal Malignancy and the Microbiome. Gastroenterology 146, 1534–1546, doi: 10.1053/j. gastro.2014.01.001 (2014)